Skill

experiment-design — write a structured experiment-spec doc

Write a structured, falsifiable experiment-spec markdown document for a CS/ML/NLP research question. Use when the user wants to design an experiment, write a study plan, plan ablations, or design a model comparison before running any code. Output: research/experiments/<slug>-YYYY-MM-DD.md. For literature searches use lit-scan; for paper digests use lit-digest; for analyzing experimental results use the future result-analyze.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/research-helper:experiment-design

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Given a research question, produce a falsifiable, sectioned experiment-spec

Supporting Files

scripts/design.pyscripts/requirements.txtscripts/test_design.py

SKILL.md

141 lines · ~1.5k tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitJun 1, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

experiment-design — write a structured experiment-spec doc

Purpose

Given a research question, produce a falsifiable, sectioned experiment-spec the user can review and pre-register before running any code. The doc covers hypothesis, variables, baselines, metrics, statistical test, multi-seed plan, expected outcomes, and risks. One of four experiment types: comparison (default), ablation, scaling, prompt-eval.

When to invoke

Triggers:

"Design an experiment for X"
"What should I run to test Y?"
"Help me write a study plan / experiment spec"
"Plan the ablation for Z"
"Plan the scaling study for W"
"Compare prompting strategies for V"
User runs /research-helper:experiment-design <question>

Do NOT invoke for:

"What's the SOTA on X?" — that's lit-scan
"Summarize paper Y" — that's lit-digest
"Analyze these results" — that's the future result-analyze
Casual questions ("how should I think about ablations") — the user wants a doc, not a chat

Workflow

1. Confirm the research question

The question gets baked into the slug and filename. Rephrasing later is fiddly. Read back the user's question and confirm before calling the script.

If the question is vague ("study LLMs"), push the user to narrow it before calling the script. A good experiment question:

Names what's compared (or what's varied / what's removed)
Names where (the dataset / benchmark / task)
Implies what's measured

2. Pick the right `--type`

If the user says	Use `--type`
"compare A vs B", "model X against Y"	`comparison` (default)
"ablation", "remove component X", "without Y"	`ablation`
"scaling", "vary the data size / parameter count", "how does N affect"	`scaling`
"prompt", "in-context", "few-shot strategy", "decoding params"	`prompt-eval`
Anything else	`comparison`

If unclear, ask the user before calling the script.

3. Call the script

python skills/experiment-design/scripts/design.py \
  --question "<the question>" \
  --type <type>

Output: research/experiments/<slug>-<YYYY-MM-DD>.md.

Knobs you might use:

--slug name: override the auto-derived slug (useful for short, memorable filenames)
--output path: override the default location entirely
--force: overwrite an existing file (the default is to refuse)

4. Walk the user through filling each section

Open the file. Go section by section. Use Edit to fill each block in place. Do not dump the whole filled doc into chat — the user opens the file alongside and follows along.

Pacing: one section per back-and-forth. Don't rush past the hypothesis to get to the variables — the hypothesis is where most experiments fail.

5. Discipline gates (don't declare "ready to run" until these pass)

Before saying the spec is finished, verify:

Hypothesis is falsifiable. Reject vague statements like "X helps" or "Y improves performance." Require:

An effect direction stated explicitly ("X reduces perplexity by ≥ d", "Y improves accuracy on benchmark B by ≥ d")
The null hypothesis explicit and distinct from the alternative
A concrete d (effect size) — even an order-of-magnitude estimate

Both baselines present. Trivial AND strong are required, not optional. Push back if the user skips the trivial baseline — "we'd never lose to chance" isn't an argument, it's an assumption worth checking.

Exactly one primary metric. Pre-registered. Multiple primaries is a red flag for p-hacking. Secondaries can be there, but mark them clearly secondary.

Seeds ≥ 3. Default suggestion: 5. Warn but don't block if the user insists on fewer (e.g., for very expensive experiments). For very cheap experiments, push for 10+.

Expected-null outcome is concrete. Not just "we'd see no effect" — what's the noise floor? What range of point estimates would be a null?

Risks section has at least 3 entries. The empty bullets need to become specific threats: data leakage, test-set peeking, distribution shift, overfit-to-validation, confounded baseline, hidden hyperparameter tuning, etc.

6. Tell the user where it lives

When done, say: "Spec at <path>. Hypothesis: X. Primary metric: Y. Seeds: N. Ready to run." Don't dump the full filled doc.

Failure modes

Output file already exists — script refuses to overwrite. Default behavior; pass --slug new-slug or --force to override. Use --slug when iterating on a different question; use --force only when intentionally rewriting.

--question produces an empty slug — happens for all-punctuation input. The script exits 1 with --question produced an empty slug; pass --slug explicitly. Get a real question from the user and retry.

User wants to skip the discipline gates ("just give me the doc") — write the doc, but flag the gaps explicitly in your handoff message. The user deserves to know what's incomplete.

Setup notes

No env vars. No dependencies beyond Python stdlib. Tests run offline.

Defaults summary

Knob	Default	When to change
`--type`	`comparison`	Use `ablation` / `scaling` / `prompt-eval` when the question fits one of those shapes
`--output`	`research/experiments/<slug>-<date>.md`	Override only if the user asks
`--slug`	auto-derived from question	Override for shorter / more memorable names
`--force`	off	Turn on only when intentionally rewriting an existing spec

experiment-design — write a structured experiment-spec doc

Invocation

Context Preview

Supporting Files

SKILL.md

experiment-design — write a structured experiment-spec doc

Invocation

Context Preview

Supporting Files

SKILL.md

experiment-design — write a structured experiment-spec doc

Purpose

When to invoke

Workflow

1. Confirm the research question

2. Pick the right `--type`

3. Call the script

4. Walk the user through filling each section

5. Discipline gates (don't declare "ready to run" until these pass)

6. Tell the user where it lives

Failure modes

Setup notes

Defaults summary

Similar Skills

experiment-design — write a structured experiment-spec doc

Purpose

When to invoke

Workflow

1. Confirm the research question

2. Pick the right `--type`

3. Call the script

4. Walk the user through filling each section

5. Discipline gates (don't declare "ready to run" until these pass)

6. Tell the user where it lives

Failure modes

Setup notes

Defaults summary

Similar Skills

experiment-design — write a structured experiment-spec doc

Invocation

Context Preview

Supporting Files

SKILL.md

experiment-design — write a structured experiment-spec doc

Invocation

Context Preview

Supporting Files

SKILL.md

experiment-design — write a structured experiment-spec doc

Purpose

When to invoke

Workflow

1. Confirm the research question

2. Pick the right --type

3. Call the script

4. Walk the user through filling each section

5. Discipline gates (don't declare "ready to run" until these pass)

6. Tell the user where it lives

Failure modes

Setup notes

Defaults summary

Similar Skills

experiment-design — write a structured experiment-spec doc

Purpose

When to invoke

Workflow

1. Confirm the research question

2. Pick the right --type

3. Call the script

4. Walk the user through filling each section

5. Discipline gates (don't declare "ready to run" until these pass)

6. Tell the user where it lives

Failure modes

Setup notes

Defaults summary

Similar Skills

2. Pick the right `--type`

2. Pick the right `--type`