From research-helper
Write a structured, falsifiable experiment-spec markdown document for a CS/ML/NLP research question. Use when the user wants to design an experiment, write a study plan, plan ablations, or design a model comparison before running any code. Output: research/experiments/<slug>-YYYY-MM-DD.md. For literature searches use lit-scan; for paper digests use lit-digest; for analyzing experimental results use the future result-analyze.
How this skill is triggered — by the user, by Claude, or both
Slash command
/research-helper:experiment-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Given a research question, produce a falsifiable, sectioned experiment-spec
Given a research question, produce a falsifiable, sectioned experiment-spec
the user can review and pre-register before running any code. The doc covers
hypothesis, variables, baselines, metrics, statistical test, multi-seed plan,
expected outcomes, and risks. One of four experiment types: comparison
(default), ablation, scaling, prompt-eval.
Triggers:
/research-helper:experiment-design <question>Do NOT invoke for:
lit-scanlit-digestresult-analyzeThe question gets baked into the slug and filename. Rephrasing later is fiddly. Read back the user's question and confirm before calling the script.
If the question is vague ("study LLMs"), push the user to narrow it before calling the script. A good experiment question:
--type| If the user says | Use --type |
|---|---|
| "compare A vs B", "model X against Y" | comparison (default) |
| "ablation", "remove component X", "without Y" | ablation |
| "scaling", "vary the data size / parameter count", "how does N affect" | scaling |
| "prompt", "in-context", "few-shot strategy", "decoding params" | prompt-eval |
| Anything else | comparison |
If unclear, ask the user before calling the script.
python skills/experiment-design/scripts/design.py \
--question "<the question>" \
--type <type>
Output: research/experiments/<slug>-<YYYY-MM-DD>.md.
Knobs you might use:
--slug name: override the auto-derived slug (useful for short, memorable filenames)--output path: override the default location entirely--force: overwrite an existing file (the default is to refuse)Open the file. Go section by section. Use Edit to fill each block in place. Do not dump the whole filled doc into chat — the user opens the file alongside and follows along.
Pacing: one section per back-and-forth. Don't rush past the hypothesis to get to the variables — the hypothesis is where most experiments fail.
Before saying the spec is finished, verify:
Hypothesis is falsifiable. Reject vague statements like "X helps" or "Y improves performance." Require:
Both baselines present. Trivial AND strong are required, not optional. Push back if the user skips the trivial baseline — "we'd never lose to chance" isn't an argument, it's an assumption worth checking.
Exactly one primary metric. Pre-registered. Multiple primaries is a red flag for p-hacking. Secondaries can be there, but mark them clearly secondary.
Seeds ≥ 3. Default suggestion: 5. Warn but don't block if the user insists on fewer (e.g., for very expensive experiments). For very cheap experiments, push for 10+.
Expected-null outcome is concrete. Not just "we'd see no effect" — what's the noise floor? What range of point estimates would be a null?
Risks section has at least 3 entries. The empty bullets need to become specific threats: data leakage, test-set peeking, distribution shift, overfit-to-validation, confounded baseline, hidden hyperparameter tuning, etc.
When done, say: "Spec at <path>. Hypothesis: X. Primary metric: Y. Seeds: N.
Ready to run." Don't dump the full filled doc.
Output file already exists — script refuses to overwrite. Default behavior;
pass --slug new-slug or --force to override. Use --slug when iterating
on a different question; use --force only when intentionally rewriting.
--question produces an empty slug — happens for all-punctuation input. The
script exits 1 with --question produced an empty slug; pass --slug explicitly.
Get a real question from the user and retry.
User wants to skip the discipline gates ("just give me the doc") — write the doc, but flag the gaps explicitly in your handoff message. The user deserves to know what's incomplete.
No env vars. No dependencies beyond Python stdlib. Tests run offline.
| Knob | Default | When to change |
|---|---|---|
--type | comparison | Use ablation / scaling / prompt-eval when the question fits one of those shapes |
--output | research/experiments/<slug>-<date>.md | Override only if the user asks |
--slug | auto-derived from question | Override for shorter / more memorable names |
--force | off | Turn on only when intentionally rewriting an existing spec |
npx claudepluginhub mhburg/research-helper --plugin research-helperSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.