Turn vague research ideas, math-heavy claims, algorithmic bets, optimization loops, evaluation plans, and autonomous research workflows into falsifiable proof programs. Use when the user asks to prove, disprove, pressure-test, peer review, backtest or refine this skill, design a sandboxed research loop, choose evaluators, create research TDD scenarios, avoid hand-wavy iteration, or decide where a research roadmap should go next.
How this skill is triggered — by the user, by Claude, or both
Slash command
/research-proof-plugin:research-proofThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Force research work to behave like a proof program: define the object, state the claim, freeze the verifier, search for counterexamples, and update the ledger when evidence changes.
evals/evals.jsonreferences/backtest-cases.mdreferences/behavioral-run-protocol.mdreferences/proof-ledger-template.mdreferences/proof-methods.mdreferences/research-claim-template.mdreferences/research-methods.mdreferences/sandbox-scenarios.jsonscripts/backtest_research_skill.pyscripts/prepare_behavioral_workspace.pyForce research work to behave like a proof program: define the object, state the claim, freeze the verifier, search for counterexamples, and update the ledger when evidence changes.
Use this before implementation when possible. If code or experiments already exist, backfill the claim, verifier boundary, proof ladder, and ledger from actual evidence before adding more work.
Do not start from a favorite mechanism. Start from a falsifiable proposition.
Bad:
Can this approach work?
Good:
For domain D and baseline B, candidate family C wins only if metric M improves by delta, guardrails G stay within budget, hidden costs H are charged, and the result survives transfer test T.
Every output must label evidence as one of:
PROVEN: follows from a proof, verified derivation, executable checker, or primary-source fact with explicit scope. Do not mark a research claim PROVEN from authority, plausibility, benchmark score, or secondary interpretation alone.SUPPORTED: current evidence points this way, but scope is limited.REJECTED: failed an explicit gate.OPEN: not tested or not formalized enough.Before implementation, require Claim, Verifier Boundary, Baseline / Candidate Family, Enemy Terms, and Rejection Gates. If any are missing, produce those first and stop implementation planning.
Choose the method by verifier strength. If the verifier is weak, do not pretend the loop is autonomous.
| Situation | Method | Use when | Main failure mode |
|---|---|---|---|
| One mutable artifact, one frozen metric | Fixed-harness research loop | A candidate can be changed mechanically and scored cheaply | Metric hacking or local overfit |
| Mathematical or algorithmic claim | Proof ladder | A claim can move from examples to lemmas to proof | Plausible prose mistaken for proof |
| Many cheap ideas, uncertain prior | Divergent researcher pool | Independent starts can explore different hypotheses | Collapse to same idea or reward hacking |
| Program-search discovery | Evaluator-gated evolution | Candidate programs can be executed and scored | Evaluator accepts clever invalid shortcuts |
| Tool-using research agent | Observable agent loop | Each step can inspect reality before choosing the next action | Infinite loop, stale state, or unverifiable progress |
| Literature-heavy research | Evidence synthesis | External sources determine priors or baselines | Cherry-picked or stale claims |
| Shipping research | Transfer gate | A fixture win must survive real constraints | Benchmark win fails outside the sandbox |
For method details, read references/research-methods.md. For mathematical proof patterns, read references/proof-methods.md.
Claim State the proposition with variables, domain, baseline, candidate family, metric, guardrails, hidden costs, and win condition.
Verifier Boundary Name what is frozen, what is mutable, what the candidate cannot inspect, who reviews outputs, and what counts as tampering.
Baseline / Candidate Family Define the current best alternative and the exact space of candidates allowed.
Enemy Terms Charge every term that can erase the win: data leakage, evaluator hacking, extra supervision, human review burden, runtime, memory, complexity, operational cost, distribution shift, numerical instability, dependency changes, or cherry-picked evidence.
Proof Ladder Move from examples to counterexamples, invariants, lemmas, derivation, executable check, formal proof, and transfer. Mark the highest level reached.
Candidate Construction Specify the strongest construction or fixture family. Include what would count as a counterexample to the current thesis.
Verification Plan Define commands, metrics, guardrails, PASS/REJECT gates, artifacts, and audit logs before implementation. If tests cannot run, say exactly what evidence is missing.
Proof Ledger
Record what changed in understanding, what got rejected, what remains open, and whether to CONTINUE, REFINE, PIVOT, or REJECT.
Next Pressure Choose the next adversarial test. Avoid "make it better" as a next step; name the failure mode.
For agentic research, require the loop to be inspectable:
goal -> compact state -> next action -> observation -> verifier -> ledger update -> stop or continue
Rules:
done before the loop starts.For normal research reviews, return:
Claim
Verifier Boundary
Baseline / Candidate Family
Current Evidence
Enemy Terms
Rejection Gates
Proof Ladder / Transfer Path
Verdict
Proof Ledger Decision
Next Pressure
For substantial research work, create or update artifacts using the templates in references/:
research-claim-template.mdproof-ledger-template.mdFor behavioral backtests of this skill, read references/behavioral-run-protocol.md before running outputs, grading, aggregation, or review.
When designing or refining research workflows, use a sandbox with frozen tests before experiments.
Minimum sandbox:
scenario -> candidate action -> frozen evaluator -> score + guardrails -> ledger decision -> patch or reject
Rules:
Run the deterministic smoke test when available:
python scripts/backtest_research_skill.py
When the user asks to backtest or refine this skill, evaluate the skill itself. Do not solve the research cases as the main task.
Use references/backtest-cases.md and references/sandbox-scenarios.json as the test suite. For each case:
PASS, WEAK, or FAIL.Return:
Backtest Matrix
Failure Patterns
Refinement Patch
Validation Plan
Reject the backtest as incomplete if it only says the skill is good, generic, or promising without naming failure patterns. The purpose is to improve the skill, not admire it.
OPEN.Do not name specific labs or projects in final artifacts unless the user asks for source attribution. Encode the patterns:
For a substantial research change, add these artifacts under the active change:
openspec/changes/<change>/research-claim.md
openspec/changes/<change>/proof-ledger.md
If the codebase already has tasks.md, add a task for the next pressure test only after the claim and rejection gate are explicit.
When updating SDD artifacts, keep this order:
research-claim -> proof-ledger -> design/tasks -> implementation
If implementation has already happened, backfill the claim and ledger from actual evidence before adding more code.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub tonyblu331/research-proof --plugin research-proof-plugin