Skill

clawpathy-autoresearch

Iteratively rewrites a SKILL.md so a downstream agent performs better against an LLM-judge rubric. Designed for eval-driven skill tuning with reproduceable scoring.

automation

developer-tools

Popularity

Stars

981

Forks

201

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/clawbio:clawpathy_autoresearch

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Eval-driven skill development. The system iteratively rewrites a `SKILL.md`

Supporting Files

SKILL.md

155 lines · ~1.4k tokens

Stats

LanguagePython

Stars981

Forks201

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

clawpathy-autoresearch

Eval-driven skill development. The system iteratively rewrites a SKILL.md so a downstream executor agent performs better at a task class, as judged by an LLM against a paper/task-specific rubric.

Core idea

  propose (sonnet)  →  execute (sonnet, shell)  →  judge (opus, rubric)
       ↑                                                       │
       └──────── feedback: verdict + recommended edits ────────┘

Proposer rewrites SKILL.md based on the last judge verdict.
Executor runs the new SKILL.md end-to-end inside a workspace.
Judge scores methodology (primary) and outputs (secondary) against a per-task rubric. Lower is better; 0 = perfect.
Keep the new SKILL.md only if it strictly beats the best score; else revert. Stop on target_score or on early_stop_n consecutive regressions.

You are the orchestrator

You (the agent reading this) don't run the loop yourself. You dispatch subagents to build the workspace, then hand off to the Python loop.

Phase 1 — Scout

Dispatch a subagent with prompts/scout.md to research the paper/task. Report key findings to the user in a few lines.

Phase 2 — Scope (you + user)

Have a conversation. Ask ONE question at a time, multiple-choice where helpful. Agree on:

what to reproduce / what success looks like
which data sources are in-bounds
what methodology expectations belong in the rubric
iteration budget and target_score (if any)

Present a summary and get approval.

Phase 3 — Build

Dispatch a builder subagent with prompts/builder.md and the agreed scope. It writes:

task.json
rubric.md — the authoritative scoring rubric for the LLM judge
reference/ (optional; judge-only)
skill/SKILL.md — seed

Validate:

from skills.clawpathy_autoresearch import validate_workspace
print(validate_workspace(Path("WORKSPACE")))  # [] means valid

Phase 4 — Loop

python -m skills.clawpathy_autoresearch WORKSPACE_DIR
# or with custom models:
python -m skills.clawpathy_autoresearch WORKSPACE_DIR \
  --proposer-model sonnet --executor-model sonnet --judge-model opus

The loop streams progress to WORKSPACE/history.jsonl, snapshots every iteration's skill to WORKSPACE/snapshots/iter-NNN.md, and writes the executor's full transcript to WORKSPACE/executor_runs/iter-NNN.log.

Workspace layout

workspace/
  task.json                  # task metadata + loop knobs
  rubric.md                  # LLM-judge rubric (the heart of the system)
  reference/                 # optional ground truth, judge-only
  skill/SKILL.md             # iterated by the loop
  output/                    # executor outputs (cleared each iter)
  executor_runs/iter-NNN.log # transcripts (judge reads these)
  snapshots/iter-NNN.md      # per-iter SKILL.md snapshots
  history.jsonl              # one row per iter: score, kept, verdict

Key principles

LLM judge only. No deterministic Python scorers. All evaluation goes through judge.md + opus. This keeps the system low-code and lets the rubric carry paper-specific nuance without adding code.
Methodology is primary. The rubric weights "did the agent use sound methods?" above "did the numbers match?". Ground-truth match is a signal, not the objective — the goal is better SKILL.md files.
Never leak ground truth. reference/ is judge-only. The executor prompt says not to read it, and the judge penalises leakage.
No hardcoded answers in SKILL.md. The proposer prompt and the judge both enforce this. The executor must derive results by running methods.
Snapshots + strict-better revert. Score on the first iter becomes the floor. Later iters that tie or regress revert to the best.

Safety

All processing is local except scout web fetches for public resources.
ClawBio disclaimer: research/education tool, not a medical device.

Gotchas

Do not skip scoping. The rubric is paper-specific; a generic rubric tunes nothing. Get the user to agree on methodology expectations.
Do not write a Python scorer. Earlier versions of this project did. They rewarded API-fetching, not methodology. The judge is the scorer.
Do not hand-pick the "best" snapshot yourself. Trust the loop. If the judge is calibrated wrong, fix the rubric, not the history.

clawpathy-autoresearch

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

clawpathy-autoresearch

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

clawpathy-autoresearch

Core idea

You are the orchestrator

Phase 1 — Scout

Phase 2 — Scope (you + user)

Phase 3 — Build

Phase 4 — Loop

Workspace layout

Key principles

Safety

Gotchas

Similar Skills

clawpathy-autoresearch

Core idea

You are the orchestrator

Phase 1 — Scout

Phase 2 — Scope (you + user)

Phase 3 — Build

Phase 4 — Loop

Workspace layout

Key principles

Safety

Gotchas

Similar Skills