Skill

research-question

Transforms vague research ideas into falsifiable, prereg-ready hypotheses with rivals and disconfirming predictions. Use before any benchmark, eval, or training work begins.

ai-ml

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/epistemic-skills:research-question

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

A vague idea is not a hypothesis.

SKILL.md

532 lines · ~6.5k tokens(exceeds 5k compaction limit)

Stats

LanguageTypeScript

Stars7

MaintenanceExcellent

Last CommitJun 4, 2026

Actions

View Source View Plugin View on GitHub View README

Research Question

Overview

A vague idea is not a hypothesis. A favorite explanation is not research.

This phase exists to stop the oldest failure mode in experimental work: seeing one interesting observation, falling in love with the first story that explains it, and then building the whole experiment around that story.

In this repo, the output of this phase is:

one OPEN draft in HYPOTHESES.md
zero code runs
zero judge-lock files
zero prereg files
a paper trail of the non-chosen explanations in experiments/{id}/alternatives/

Current repo reality matters. src/state/repo.ts already persists bestCaseConclusion and computeTarget on HypothesisEntry. Use the real shape. Do not invent a smaller checklist because an older note called it "seven fields."

One question at a time. One ambiguity killed at a time. No giant intake form. No favorite theory disguised as inevitability.

The Iron Law

NO HYPOTHESIS WITHOUT A RIVAL AND A WAY TO LOSE

If the observation only has one explanation, you are not done. If the explanations do not have distinct disconfirming predictions, you are not done. If the claim still cannot be falsified mechanically, you are not done. If the compute target is still "we'll figure it out later," the cost and runtime story is still fiction.

Do not move to preregistration with a claim that only sounds disciplined because the wording got longer.

When to Use

Use this skill when:

a researcher says "maybe X is better", "I think Y helped", or "should we test whether this is more robust?"
you have a surprising observation but no clean causal story yet
the comparator is still implied
the metric is still vague
the falsifier is still philosophical
the baseline is still folklore instead of a named source
sample size is still hand-wavy
cost cap depends on hope
compute target is still unstated
you need a prereg-ready draft in HYPOTHESES.md

When NOT to Use

Do not use this skill when:

experiments/{id}/prereg.md already exists and the contract is being locked; use /skill:preregistration
you are reproducing an external or published comparator; use /skill:baseline-reproduction
you are about to run code, evals, training, benchmarks, or judge-backed scoring
you already have numbers and want to retrofit a cleaner question around them
you are deciding whether to kill, recommit, or ship
the repo scaffold is missing and HYPOTHESES.md does not exist yet
you are tempted to ask for every field in one message and call that rigor

Working Surface

Surface	API or file	Why it matters now
Repo sanity	`loadRepoState(cwd)`	Confirm the epistemic scaffold exists before assuming normal flow
Canonical registry	`HYPOTHESES.md`	Final output of this stage
Read live ideas	`loadHypotheses(cwd)`	Avoid overwriting active work
Parse registry text	`parseHypotheses(content)`	Use only if raw markdown is already loaded
Detect live work	`getActiveHypothesis(entries)`	Decide whether you are refining an existing idea or minting a new `id`
Render and persist	`hypothesisToMarkdown(entry)`, `saveHypotheses(cwd, entries)`	Preserve the real repo format, including `bestCaseConclusion` and `computeTarget`
Baseline context	`BASELINES.md`, `loadBaselines(cwd)`	Name the comparator precisely instead of vaguely
Baseline freshness	`getBaselineAgeDays(entry)`	Detect stale references before you lean on them
Prior spend	`getHypothesisSpend(cwd, id)`, `getAllHypothesisSpends(cwd)`	Ground the budget in actual burn instead of vibes
Cost ledger	`.epistemic/cost-ledger.jsonl`	Planned here, filled later
Alternative archive	`experiments/{id}/alternatives/`	Preserve rejected explanations instead of letting them vanish
Later prereg artifact	`experiments/{id}/prereg.md`	Next phase only; do not write it here

The Prereg-Ready Checklist

Legacy shorthand called this the "7-field" checklist. That shorthand is stale.

In this repo, a prereg-ready hypothesis has eight required research fields because bestCaseConclusion and computeTarget are first-class parts of the contract:

claim — one measurable sentence with an intervention, comparator, metric, and context.
falsifier — one empirical sentence answering: What would disprove this?
bestCaseConclusion — one-sentence low-expectations framing: the strongest boring conclusion you are allowed to write if the experiment succeeds.
n — the sample size: prompts, seeds, tasks, or runs.
judgeRef — the exact judge configuration you expect to lock later.
baselineRef — the comparator name plus enough provenance to reproduce it later.
costCap — the maximum USD the idea is allowed to consume before a real decision point.
computeTarget — where the experiment will run: local, docker, or modal.

Housekeeping fields still matter: id, status, and timestamp.

Do not confuse "the registry can store it" with "the hypothesis is scientifically complete." The checklist is the research contract. The markdown shape is only the transport.

Competing Hypotheses

This phase comes before you settle on one claim.

Start from one observation. Generate 2-3 competing explanations for that same observation. Each explanation must have its own unique disconfirming prediction.

If two explanations die under the same killer test, they are not meaningfully separated yet. Tighten them or delete one.

Use a table like this:

Explanation	Unique disconfirming prediction	Falsifiability score (1-5)	Cost to test	Prior plausibility (1-5)
A	What result would kill A specifically?	1-5	dollars or low/med/high	1-5
B	What result would kill B specifically?	1-5	dollars or low/med/high	1-5
C	What result would kill C specifically?	1-5	dollars or low/med/high	1-5

Ranking rule:

Prefer higher falsifiability
Then prefer lower cost to test
Then prefer higher prior plausibility

Do not let "it feels most likely" outrank "it is easiest to kill." Research gets cleaner when the chosen explanation is cheap to defeat.

Workflow:

The coding agent proposes the 2-3 ranked explanations.
The researcher picks one explicit winner.
The chosen explanation becomes the claim candidate.
The non-chosen explanations are written to experiments/{id}/alternatives/.
Each alternative note records:
- the observation
- the alternative explanation
- its unique disconfirming prediction
- falsifiability score
- cost to test
- prior plausibility
- why it was not chosen now

Unchosen explanations are not trash. They are live audit material. If the chosen claim dies later, that archive is where honest follow-up starts.

The Process

1. Resolve the repository context before asking content questions

Start from repo state, not memory.
Use loadRepoState(cwd) if you need a top-level scaffold check.
Use loadHypotheses(cwd) to inspect HYPOTHESES.md.
If raw markdown is already in hand, use parseHypotheses(content) instead of improvising a parser.
Use getActiveHypothesis(entries) to see whether one idea is already OPEN or RUNNING.
Decide whether the current conversation refines that exact idea or deserves a new id.
If the new idea is materially different, mint a new id before you archive alternatives.
Do not silently overwrite a live record because the new wording sounds cleaner.
If the scaffold is missing, fix that first instead of pretending normal flow exists.
Only once you know which record you are touching do you start the questioning loop.

2. Ask one question at a time from the observation outward

Ask one question.
Wait for one answer.
Compress that answer into one resolved ambiguity.
Ask the next question only after the previous ambiguity is actually closed.
Prefer multiple choice when it reduces drift.
Use open questions when the field itself is still undefined.
Never ask for all eight fields in one blast.
Giant intake forms create placeholder answers and hidden contradictions.
Start from the observed fact: what changed, where, against what, and how do we know?
Do not let the first explanation sneak in as if it were already the claim.
If the answer bundles multiple observations, split them before continuing.
Progress is not measured by how much text was exchanged. Progress is measured by how much ambiguity died.

3. Build competing hypotheses before you bless one story

Take the same observation and generate 2-3 plausible explanations for it.
Each explanation must explain the same observed fact, not a different fact.
Each explanation must come with one unique disconfirming prediction.
Write the ranking table explicitly: falsifiability score, cost to test, prior plausibility.
Rank the explanations using the rule above: more falsifiable, then cheaper, then more plausible.
Present the ranked set to the researcher.
The researcher picks one.
Record the choice explicitly. Do not silently choose by omission.
Write the non-chosen explanations to experiments/{id}/alternatives/ immediately.
If you cannot produce at least two plausible rivals, you do not understand the observation well enough yet.
If one explanation only survives because its disconfirming prediction is vague, lower its rank.
If an explanation cannot be killed by a practical experiment, it is weak no matter how elegant it sounds.

4. Lock the chosen claim down until it is measurable

A claim is not a slogan.
Force four concrete parts: intervention, comparator, metric, and context.
Ask what exactly is changing.
Ask what exactly it is being compared against.
Ask which metric decides the winner.
Ask where that metric will be measured: benchmark, dataset, task slice, or workload.
Ask for directionality: higher, lower, faster, cheaper, more accurate.
If a threshold matters, ask for it now.
Reject words like better, smarter, more robust, or more aligned when no metric is named.
Reject bundled claims that one experiment cannot falsify.
Reject claims that rely on an implied comparator.
Keep tightening until the claim can later live in one sentence in RESULTS.md without extra storytelling.

5. Run the falsifiability test hard

Ask the question directly: What would disprove this?
If the answer is not empirical, stop.
A valid falsifier must be observable from outputs, scores, logs, or other measurable artifacts.
A valid falsifier must be reachable by the planned experiment.
Reject answers about intent, elegance, vibes, philosophy, or metaphysics.
Reject moving-goal clauses like "unless the seed was weird" or "unless the judge missed nuance."
Force the falsifier into one sentence a hostile reviewer could apply mechanically.
Good form: "If metric M does not exceed comparator C by threshold T under condition K across n runs, the claim is falsified."
If the falsifier cannot be written that cleanly, the claim is still vague.
Fix the claim, then ask again.
Unfalsifiable ideas do not get into HYPOTHESES.md.

6. Choose sample size before any number exists

Define what one observation means: prompt, task, seed, batch, or full run.
Persist that count as n.
Reject n = TBD.
Reject "until stable."
Reject "we'll start small and see."
Match n to the falsifier.
If the falsifier is about mean improvement, n must support a mean.
If it is about win rate, n must support a win rate.
If the process is stochastic, n = 1 is usually theater.
If the process is deterministic, ask why repetition is unnecessary.
If the declared n does not fit the likely budget, narrow the claim instead of pretending the sample is enough.

7. Capture judge, baseline, compute target, and budget with enough precision to survive preregistration

For the judge, require exact leaves: model, prompt, temperature, and seed.
Reject latest, default, current prompt, and other drifting placeholders.
Once those leaves are known, compress them into one stable judgeRef.
For the baseline, require a name, source URL, quoted score when known, version when known, and retrieval date when known.
Use loadBaselines(cwd) to see whether the comparator already exists in BASELINES.md.
If it exists, inspect freshness with getBaselineAgeDays(entry).
If the baseline is stale, say so immediately instead of hiding the problem for later.
Then ask the compute question exactly: "Where will experiments run? local, Docker, or Modal?"
Record the answer as computeTarget.
Explain the trade-offs plainly:
- local — fastest feedback loop, lowest setup overhead, easiest interactive debugging, highest environment drift risk
- docker — slower upfront, better dependency and OS pinning, better handoff and reproducibility, usually the safest default when environment matters
- modal — best for remote parallelism or heavier managed compute, but adds orchestration, secrets, cold starts, and extra spend; not a free default
Reject "we'll start local and decide later" unless that environment switch is part of the registered design.
Estimate cost from token math and compute reality, not optimism.
If the compute target is modal, include infra overhead and retries in the budget.
If the compute target is docker, include image build and environment prep costs when they are real.
If the compute target is local, make sure the hardware and ambient environment assumptions are actually credible.
Use getHypothesisSpend(cwd, id) for resumed ideas and getAllHypothesisSpends(cwd) when several open ideas are already burning money.
Reject caps that cannot fund the declared n under the chosen compute target.
Reject caps like $0, uncapped, or whatever it takes.

8. Set `bestCaseConclusion` before results can seduce you

Ask: "If everything goes right, what is the strongest conclusion you would allow yourself to write?"
Force one sentence.
Keep it benchmark-bound, judge-bound, baseline-bound, and smaller than the story in anyone's head.
This is low-expectations framing, not a victory speech.
Reject sweeping answers like "this changes everything" or "this proves general reasoning."
Prefer modest conclusions such as: "Under the locked judge on benchmark B, method X appears better than the named baseline."
Check coherence across all eight research fields.
The claim and falsifier must talk about the same metric.
n must be large enough for the falsifier.
The budget must actually fund n under the chosen compute target.
bestCaseConclusion must be narrower than the claim surface, not broader.
If any pair conflicts, ask one more question and fix the conflict before writing the draft.

9. Write the draft to `HYPOTHESES.md` and stop there

Create or confirm the hypothesis id.
Add housekeeping fields: status: "OPEN" and timestamp: Date.now().
Load entries with loadHypotheses(cwd).
Update the target entry in memory or append a new one.
Persist the real repo fields: id, claim, falsifier, bestCaseConclusion, n, judgeRef, baselineRef, costCap, computeTarget, status, and timestamp.
Use saveHypotheses(cwd, entries) so the file stays parseable by parseHypotheses(content).
Write every non-chosen explanation to experiments/{id}/alternatives/.
Do not invent a parallel markdown format.
Do not write experiments/{id}/prereg.md yet.
Do not write experiments/{id}/judge.lock yet.
Do not run code yet.
Exit condition: the idea now exists as an OPEN draft in HYPOTHESES.md, the rival explanations are archived, and every research field is specific enough to survive preregistration.

Common Rationalizations

Excuse	Reality
"The first explanation is probably the right one."	Probably is exactly why you generate rivals first.
"We already know what caused the observation."	Then it should survive distinct competing explanations and distinct killer predictions.
"Two alternatives can share the same disconfirming test."	Then they are not separated explanations yet. Tighten or delete one.
"I'll just ask for all the fields at once."	Then you will get polished vagueness all at once.
"The claim is directionally clear."	Direction is not a metric, comparator, or falsifier.
"We can define the falsifier later."	Then the claim will drift to protect itself.
"The alternatives don't need to be saved."	Unwritten rejected explanations come back later as convenient excuses.
"The baseline is obvious."	If it is not named and sourced, it is folklore.
"`n` depends on how the first runs look."	That is optional stopping in a lab coat.
"`bestCaseConclusion` is just wording."	It is the ceiling that prevents post-hoc marketing.
"`computeTarget` is implied."	Hidden infrastructure assumptions are still assumptions. Record them.
"We'll decide local vs Docker vs Modal after a smoke run."	Environment choice changes cost, runtime, and reproducibility. Decide before evidence exists.
"Docker is overkill for a small experiment."	If reproducibility matters, the overhead is the point.
"Modal is just more cores."	Remote infrastructure changes billing, secrets, startup, and failure modes.
"The registry can stay compact; we'll remember the rest."	If it is not in the record, it will drift.
"The gates will catch mistakes later."	Gates do not rescue vague thinking.

Red Flags - STOP

You only have one favored explanation for the observation.
Two competing explanations share the same disconfirming prediction.
You are about to ask multiple unresolved questions in one message.
The claim still uses words like better, smarter, or more robust without a metric.
The falsifier cannot answer "What would disprove this?" in one empirical sentence.
n is still TBD, until stable, or whatever fits.
The judge config still contains latest, default, or a mutable scratch prompt.
The baseline reference is a brand name with no source URL or version.
getBaselineAgeDays(entry) shows the only local comparator is stale and you are ignoring it.
computeTarget is blank, implied, or later.
The budget only works on local but the planned run is actually modal.
The non-chosen explanations exist only in chat and not in experiments/{id}/alternatives/.
bestCaseConclusion is broader than the claim.
getActiveHypothesis(entries) points at a different live idea and you are still about to overwrite it.
You feel pressure to run code before the rivals are separated and archived.

Good vs Bad

Good: competing hypotheses from one observation

Observation:
Prompt B beat Prompt A on GSM8K in three pilot traces.

| Explanation | Unique disconfirming prediction | Falsifiability | Cost | Prior |
| --- | --- | --- | --- | --- |
| Reasoning scaffold helps multi-step arithmetic | Gain persists under exact-match and concentrates on multi-step items | 5/5 | $0.60 | 4/5 |
| Judge prefers verbose answers | Gain disappears under exact-match or a terse-answer rubric | 5/5 | $0.40 | 3/5 |
| Prompt B leaks answer templates | Gain disappears when demonstrations are length-matched but content-swapped | 4/5 | $0.80 | 2/5 |

Why it is good:

one observation generated multiple live explanations
each explanation has a distinct killer prediction
the ranking is explicit
the researcher can choose one without pretending the others never existed

Bad

Observation:
Prompt B looks smarter.

Explanation 1:
Prompt B is better.

Explanation 2:
Maybe the judge liked it.

Explanation 3:
Maybe randomness.

Why it is bad:

the observation is already a conclusion
none of the explanations is precise
none has a unique disconfirming prediction
there is no ranking and nothing to archive honestly

Good: compute target question

Q: Where will experiments run? local, Docker, or Modal?
A: Docker. The run is CPU-only, dependency-sensitive, and another researcher must be able to rerun it unchanged.

Why it is good:

the question is explicit
the answer is tied to actual constraints
computeTarget now has real methodological meaning

Bad

We'll start local and move it wherever if the numbers look good.

Why it is bad:

the environment can now drift after peeking
budget and runtime assumptions are fake
wherever is not a registered compute target

Good: draft write to `HYPOTHESES.md`

const entries = await loadHypotheses(cwd);

entries.push({
  id: "prompt-b-vs-prompt-a-gsm8k-2026-05-31",
  claim: "Prompt B improves exact-match over Prompt A on GSM8K under the named judge.",
  falsifier: "If mean exact-match improvement is less than 2 points across n=30 runs, the claim is falsified.",
  bestCaseConclusion: "Under the locked judge on GSM8K, Prompt B may outperform Prompt A by a modest margin.",
  n: 30,
  judgeRef: "model=gpt-5.4-mini,prompt=prompts/gsm8k-judge-v3.md@9f3e2c1,temp=0,seed=17",
  baselineRef: "prompt-a|url=https://example.com/prompts/a|version=2026-05-31|retrieved=2026-05-31",
  costCap: 18,
  computeTarget: "docker",
  status: "OPEN",
  timestamp: Date.now(),
});

await saveHypotheses(cwd, entries);

Why it is good:

it uses the real state helpers
it persists the actual repo fields
it includes bestCaseConclusion and computeTarget
it leaves a clean OPEN draft for preregistration

Bad

## Hypothesis: big-win
- **Claim:** Prompting is better
- **Falsifier:** If the vibe is off
- **N:** We'll see
- **Judge:** latest
- **Baseline:** SOTA
- **Cost cap:** whatever it takes
- **Compute target:** later
- **Best-case conclusion:** This changes everything

Why it is bad:

every critical field is vague, drifting, or unserious
there are no rival explanations
the environment is undecided
the conclusion ceiling is marketing, not science

Good: archive the losers

# Alternative 02: judge-format-bias

- Observation: Prompt B beat Prompt A on GSM8K in three pilot traces.
- Explanation: The judge prefers verbose step-by-step answers rather than better arithmetic.
- Unique disconfirming prediction: The win disappears under exact-match or a terse-answer rubric.
- Falsifiability score: 5/5
- Cost to test: $0.40
- Prior plausibility: 3/5
- Not chosen now because: The researcher selected the reasoning-scaffold explanation as the primary claim.

Why it is good:

the rejected explanation is preserved
the disconfirming prediction stays attached to it
later review can see what was rejected and why

Bad

We considered some other ideas but they were weaker.

Why it is bad:

nothing is auditable
weaker by what standard is unknown
the discarded explanations can now be reinvented whenever convenient

Why This Matters

Most bad research does not begin with fabricated numbers. It begins when the first plausible story becomes the only story, and nobody records what else could have explained the same observation.

Competing hypotheses make the question honest. bestCaseConclusion keeps the future writeup small enough to deserve trust. computeTarget stops environment drift from masquerading as methodological detail.

This stage is cheap. That is why people try to skip it.

After the draft exists in HYPOTHESES.md, the non-chosen explanations are written to experiments/{id}/alternatives/, and the prereg-ready fields are explicit, use /skill:preregistration.

Output: Writing RESEARCH.md

When all slots are filled and confirmed by the user:

Generate the complete Research Document using the template from docs/research-document.md
Write it to RESEARCH.md in the repo root (overwrite if it already exists)
For each Research Story in section 10 (RS-001, RS-002, …), append an entry to HYPOTHESES.md:

Before appending: if HYPOTHESES.md does not yet exist, create it with the following header (including the trailing blank line):
```
# Hypotheses

Registered hypotheses for this research project.
```
Then append each entry. Get the current timestamp by running:
```
node -e "console.log(Date.now())"
```
This produces a 13-digit Unix millisecond integer (e.g. 1748995200000). Use that value for the Timestamp field.

Set Compute target to local, docker, or modal based on the compute requirements described in the Research Story.

Each entry must use this exact format (field names match hypothesisToMarkdown exactly):
```
## Hypothesis: RS-NNN
- **Status:** OPEN
- **Claim:** <story title as a testable claim>
- **Falsifier:** <the observation that would falsify this claim>
- **Best case conclusion:** <one sentence on what a positive result would let you conclude>
- **N:** 30
- **Judge:** <judge model or script ref, e.g. gpt-4o or scripts/judge.py>
- **Baseline:** <baseline description, e.g. zero-shot GPT-4o>
- **Cost cap:** 50
- **Compute target:** local
- **Timestamp:** 1748995200000
```
Notify the user: "Research Document written to RESEARCH.md. Open the graph to see your hypotheses — the graph auto-refreshes within 2 seconds."

The graph panel reads RESEARCH.md and HYPOTHESES.md and will display the Research Document as the root node with Research Stories as circles.

research-question

Popularity

Invocation

Context Preview

SKILL.md

research-question

Popularity

Invocation

Context Preview

SKILL.md

Research Question

Overview

The Iron Law

When to Use

When NOT to Use

Working Surface

The Prereg-Ready Checklist

Competing Hypotheses

The Process

1. Resolve the repository context before asking content questions

2. Ask one question at a time from the observation outward

3. Build competing hypotheses before you bless one story

4. Lock the chosen claim down until it is measurable

5. Run the falsifiability test hard

6. Choose sample size before any number exists

7. Capture judge, baseline, compute target, and budget with enough precision to survive preregistration

8. Set bestCaseConclusion before results can seduce you

9. Write the draft to HYPOTHESES.md and stop there

Common Rationalizations

Red Flags - STOP

Good vs Bad

Good: competing hypotheses from one observation

Good: compute target question

Good: draft write to HYPOTHESES.md

Good: archive the losers

Why This Matters

Output: Writing RESEARCH.md

Similar Skills

Research Question

Overview

The Iron Law

When to Use

When NOT to Use

Working Surface

The Prereg-Ready Checklist

Competing Hypotheses

The Process

1. Resolve the repository context before asking content questions

2. Ask one question at a time from the observation outward

3. Build competing hypotheses before you bless one story

4. Lock the chosen claim down until it is measurable

5. Run the falsifiability test hard

6. Choose sample size before any number exists

7. Capture judge, baseline, compute target, and budget with enough precision to survive preregistration

8. Set bestCaseConclusion before results can seduce you

9. Write the draft to HYPOTHESES.md and stop there

Common Rationalizations

Red Flags - STOP

Good vs Bad

Good: competing hypotheses from one observation

Good: compute target question

Good: draft write to HYPOTHESES.md

Good: archive the losers

Why This Matters

Output: Writing RESEARCH.md

Similar Skills

8. Set `bestCaseConclusion` before results can seduce you

9. Write the draft to `HYPOTHESES.md` and stop there

Good: draft write to `HYPOTHESES.md`

8. Set `bestCaseConclusion` before results can seduce you

9. Write the draft to `HYPOTHESES.md` and stop there

Good: draft write to `HYPOTHESES.md`