From clawbio
Iteratively rewrites a SKILL.md so a downstream agent performs better against an LLM-judge rubric. Designed for eval-driven skill tuning with reproduceable scoring.
How this skill is triggered — by the user, by Claude, or both
Slash command
/clawbio:clawpathy_autoresearchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Eval-driven skill development. The system iteratively rewrites a `SKILL.md`
__init__.py__main__.pydispatcher.pyexamples/champions/README.mdexamples/champions/trubetskoy_scz_finemap_0.235.mdexamples/champions/yengo_height_ldsc_h2.mdexamples/demo_task/README.mdexamples/demo_task/rubric.mdexamples/demo_task/task.jsonexamples/presentation_intro.mdexamples/skill_transfer_summary.pngexecutor.pyjudge.pyloop.pyloop_parallel.pymultitask_loop.pyprompts/builder.mdprompts/executor.mdprompts/judge.mdprompts/proposer.mdEval-driven skill development. The system iteratively rewrites a SKILL.md
so a downstream executor agent performs better at a task class, as judged
by an LLM against a paper/task-specific rubric.
propose (sonnet) → execute (sonnet, shell) → judge (opus, rubric)
↑ │
└──────── feedback: verdict + recommended edits ────────┘
early_stop_n consecutive regressions.You (the agent reading this) don't run the loop yourself. You dispatch subagents to build the workspace, then hand off to the Python loop.
Dispatch a subagent with prompts/scout.md to research the paper/task.
Report key findings to the user in a few lines.
Have a conversation. Ask ONE question at a time, multiple-choice where helpful. Agree on:
Present a summary and get approval.
Dispatch a builder subagent with prompts/builder.md and the agreed
scope. It writes:
task.jsonrubric.md — the authoritative scoring rubric for the LLM judgereference/ (optional; judge-only)skill/SKILL.md — seedValidate:
from skills.clawpathy_autoresearch import validate_workspace
print(validate_workspace(Path("WORKSPACE"))) # [] means valid
python -m skills.clawpathy_autoresearch WORKSPACE_DIR
# or with custom models:
python -m skills.clawpathy_autoresearch WORKSPACE_DIR \
--proposer-model sonnet --executor-model sonnet --judge-model opus
The loop streams progress to WORKSPACE/history.jsonl, snapshots every
iteration's skill to WORKSPACE/snapshots/iter-NNN.md, and writes the
executor's full transcript to WORKSPACE/executor_runs/iter-NNN.log.
workspace/
task.json # task metadata + loop knobs
rubric.md # LLM-judge rubric (the heart of the system)
reference/ # optional ground truth, judge-only
skill/SKILL.md # iterated by the loop
output/ # executor outputs (cleared each iter)
executor_runs/iter-NNN.log # transcripts (judge reads these)
snapshots/iter-NNN.md # per-iter SKILL.md snapshots
history.jsonl # one row per iter: score, kept, verdict
judge.md + opus. This keeps the system low-code and lets the
rubric carry paper-specific nuance without adding code.reference/ is judge-only. The executor
prompt says not to read it, and the judge penalises leakage.npx claudepluginhub clawbio/clawbio --plugin clawbioAutonomously optimizes skill prompts using a mutate/score/keep evolutionary loop with git-based revert. Useful for improving SKILL.md performance over time.
Iteratively improves a copied lab skill candidate against explicit evaluation goals, recording revisions and tradeoffs. Promotes manually only.
Automated skill improvement loop that runs evals, diagnoses judge failures from traces and rationale, edits SKILL.md to fix issues, re-runs, and checks for regressions. Use when improving a skill based on eval results without manual iteration.