By tcapelle
Experiment-driven development for ML codebases. Hypothesis-first, baseline+variant, smoke-before-full discipline with W&B as the source of truth. Bundles wbagent W&B query helpers, project-local experiment launcher/entrypoint setup, experiment report helpers, and on-demand W&B query and EDD reviewer agents.
Use this agent in Phase 5 of the wandb-driven-dev workflow to review a completed experiment and draft the ## Result block. The agent reads plan.md, the staged result (if the watcher produced one), pulls fresh wandb summaries, validates that the only config difference between baseline and variant is the named dimension, and proposes a verdict (pass/fail/inconclusive) with key numbers and merge recommendation. Returns the proposed ## Result markdown — DOES NOT auto-write it. The main thread shows it to the user for approval before splicing into plan.md. Examples: <example>context: User typed 'review experiment 20260429-loss-scheme'. assistant: 'Delegating to the reviewer agent to draft the verdict.'</example> <example>context: wandb-driven-dev skill Phase 5 invoked after a watcher run. assistant: 'Spawning reviewer agent with the slug and worktree path.'</example>
Use this agent for off-thread analysis of W&B projects and training runs — anything that requires scanning more than a handful of runs, pulling histories, comparing configs across many runs, or diagnosing a crashed/diverged run. Frees the main thread from large query outputs. Examples: <example>user: 'find every run in wandbproject/foo where val_loss < 0.5 and tell me what hyperparameters they share' assistant: 'I'll delegate this to the wandb-query agent — it'll scan the project and return a structured summary.'</example> <example>user: 'why did run abc123 crash?' assistant: 'I'll spawn the wandb-query agent to pull the history and crash signal.'</example> <example>context: wandb-driven-dev skill in Phase 5 needs deep analysis before drafting a verdict. assistant: 'Delegating cross-run config + history scan to wandb-query.'</example> Do not use for single-run summary lookups (one summary_metrics.get call) — the round-trip overhead isn't worth it.
Enforce a systematic, reproducible approach to empirical questions — hypothesis first, baseline + variant, smoke before full, wandb as source of truth. Use whenever the user asks 'does X work?', 'is A better than B?', 'what's the best N?', or says 'experiment', 'ablation', 'benchmark', 'sweep'. Also use for first-run setup/reconfigure of a repo's experiment launcher command, training entrypoint, reproduction model, GPU budgets, W&B metrics, and for W&B Reports from experiment runs.
Use this skill for querying and analyzing Weights & Biases projects through the W&B SDK and the local `wandb_helpers.py` query helpers. Covers run discovery, filtered run-table queries, selected summary metrics, config comparisons, exact run counts, artifacts, sweeps, reports, and bounded history scans. This is the preferred way to query W&B from the wandb-driven-dev plugin.
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A Claude Code plugin that enforces experiment-driven development for ML codebases. Every empirical claim is backed by a W&B run, a baseline to compare against, and a falsifier written before the run started. Nothing merges on vibes.
| Skill | Trigger | Purpose |
|---|---|---|
wandb-driven-dev | /wandb-driven-dev (also auto-triggers on "experiment", "ablation", "is A better than B", setup/reconfigure, and W&B Report requests) | The methodology — Phases 0–6 from setup to cleanup, with project-local experiment launcher config, training entrypoint, worktree bootstrap, smoke gates, launch, ETA-aware watcher, review, and experiment report helpers. |
wbagent | Auto-triggered on W&B queries | Toolkit for querying W&B runs, summaries, configs, histories, artifacts, sweeps, and reports through wandb_helpers.py. |
Experiment report helpers live in wandb-driven-dev; the general W&B Reports
authoring guide (recipe, skeleton, filters, gotchas) lives in the wbagent
reference at skills/wbagent/references/REPORTS.md.
Fast count, top-k, and at-step comparison workflows use reusable query
primitives in skills/wbagent/scripts/wandb_helpers.py, so common W&B questions
do not require one-off Python or broad run iteration.
Project setup discovery is hardcoded in
skills/wandb-driven-dev/scripts/setup_project.py. It reads the local config,
uses configured decision/health metrics, samples a few recent finished runs
when run IDs are not supplied, and writes curves plus
wandb_metadata.preflight back to .claude/wandb-driven-dev.local.md. It
stores keys and decisions only, not run summaries or metric values. The default
path is a single selected-summary GraphQL query, so it does not materialize SDK
run objects or scan history. Use --validate-history only when summary
coverage is ambiguous; that slower path uses explicit scan_history(keys=...),
bounded scans, and targeted sparse metric fallback.
Plot-reading workflows are hardcoded in
skills/wandb-driven-dev/scripts/curve_analysis.py. It assumes setup already
persisted curve step keys, then turns selected W&B curves into pandas-derived
features such as value at step, local slope, trend, and best run by value/slope.
Slope/trend use trailing rolling smoothing by default and report
noise/confidence fields so noisy endpoints do not dominate the verdict. The
analyzer has separate early-training health checks for launch stability and
progress-stage checks for slope shifts and sudden spikes.
| Agent | When | Purpose |
|---|---|---|
wandb-query | On-demand | Off-thread analysis of a W&B project or run. Frees the main thread from large query outputs. |
reviewer | Spawned by the wandb-driven-dev skill in Phase 5 | Reads the staged result, validates numbers against fresh W&B summaries, drafts the verdict and merge recommendation. |
templates/wandb-driven-dev.local.md.template — copy to your project as
.claude/wandb-driven-dev.local.md for per-project config (W&B entity/project,
repo launcher command, training entrypoint, default metrics, GPU budgets,
free-form notes).
Drop this directory into a Claude Code plugins location, or run Claude Code pointing at it directly:
cc --plugin-dir /path/to/wandb-driven-dev
wbagent is vendored directly into skills/wbagent/ as plain files, copied
from the upstream W&B core repository at
services/wb_agent/src/agent_repository/context_content/production/wbagent/skills/wbagent.
The original upstream base commit is recorded in
skills/wbagent/.upstream-commit, but this plugin intentionally carries local
query-helper extensions in skills/wbagent/scripts/wandb_helpers.py. Do not
overwrite skills/wbagent/ with a blind upstream sync; port local improvements
to upstream manually and then reconcile the vendored copy deliberately.
/wandb-driven-dev setup
Claude interviews you for the W&B project, repo-specific experiment launcher
command, training entrypoint, reproduction model, GPU budgets, and
decision/health metrics, then writes
.claude/wandb-driven-dev.local.md. Subsequent invocations read it.
For a new experiment:
/wandb-driven-dev
Claude walks you through hypothesis → design → smoke → launch → review →
cleanup, gating each phase. Wandb runs use the exp/<slug> tag and
exp-<slug>-<role> name convention so they're trivially filterable.
wandb and pandas Python packages on the Python you're runningwandb login or WANDB_API_KEY).claude/wandb-driven-dev.local.md (per-project, gitignored):
npx claudepluginhub tcapelle/wandb-driven-dev --plugin wandb-driven-devComprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
Develop, test, build, and deploy Godot 4.x games with Claude Code. Includes GdUnit4 testing, web/desktop exports, CI/CD pipelines, and deployment to Vercel/GitHub Pages/itch.io.
Ultra-compressed communication mode. Cuts ~75% of tokens while keeping full technical accuracy by speaking like a caveman.
Access thousands of AI prompts and skills directly in your AI coding assistant. Search prompts, discover skills, save your own, and improve prompts with AI.
Comprehensive UI/UX design plugin for mobile (iOS, Android, React Native) and web applications with design systems, accessibility, and modern patterns
Upstash Context7 MCP server for up-to-date documentation lookup. Pull version-specific documentation and code examples directly from source repositories into your LLM context.