Search everything...

Stats

Actions

Available In

wandb-driven-dev

Name: wandb-driven-dev
Author: tcapelle

By tcapelle

Experiment-driven development for ML codebases. Hypothesis-first, baseline+variant, smoke-before-full discipline with W&B as the source of truth. Bundles wbagent W&B query helpers, project-local experiment launcher/entrypoint setup, experiment report helpers, and on-demand W&B query and EDD reviewer agents.

npx claudepluginhub tcapelle/wandb-driven-dev --plugin wandb-driven-dev

Popularity

Stars

Med: 0·Avg: 285

Installs

Med: 0·Avg: 1

What's Inside

Agents2

reviewer

/reviewer

Use this agent in Phase 5 of the wandb-driven-dev workflow to review a completed experiment and draft the ## Result block. The agent reads plan.md, the staged result (if the watcher produced one), pulls fresh wandb summaries, validates that the only config difference between baseline and variant is the named dimension, and proposes a verdict (pass/fail/inconclusive) with key numbers and merge recommendation. Returns the proposed ## Result markdown — DOES NOT auto-write it. The main thread shows it to the user for approval before splicing into plan.md. Examples: <example>context: User typed 'review experiment 20260429-loss-scheme'. assistant: 'Delegating to the reviewer agent to draft the verdict.'</example> <example>context: wandb-driven-dev skill Phase 5 invoked after a watcher run. assistant: 'Spawning reviewer agent with the slug and worktree path.'</example>

wandb-query

/wandb-query

Use this agent for off-thread analysis of W&B projects and training runs — anything that requires scanning more than a handful of runs, pulling histories, comparing configs across many runs, or diagnosing a crashed/diverged run. Frees the main thread from large query outputs. Examples: <example>user: 'find every run in wandbproject/foo where val_loss < 0.5 and tell me what hyperparameters they share' assistant: 'I'll delegate this to the wandb-query agent — it'll scan the project and return a structured summary.'</example> <example>user: 'why did run abc123 crash?' assistant: 'I'll spawn the wandb-query agent to pull the history and crash signal.'</example> <example>context: wandb-driven-dev skill in Phase 5 needs deep analysis before drafting a verdict. assistant: 'Delegating cross-run config + history scan to wandb-query.'</example> Do not use for single-run summary lookups (one summary_metrics.get call) — the round-trip overhead isn't worth it.

Skills2

wandb-driven-dev

/wandb-driven-dev

Enforce a systematic, reproducible approach to empirical questions — hypothesis first, baseline + variant, smoke before full, wandb as source of truth. Use whenever the user asks 'does X work?', 'is A better than B?', 'what's the best N?', or says 'experiment', 'ablation', 'benchmark', 'sweep'. Also use for first-run setup/reconfigure of a repo's experiment launcher command, training entrypoint, reproduction model, GPU budgets, W&B metrics, and for W&B Reports from experiment runs.

wbagent

/wbagent

Use this skill for querying and analyzing Weights & Biases projects through the W&B SDK and the local `wandb_helpers.py` query helpers. Covers run discovery, filtered run-table queries, selected summary metrics, config comparisons, exact run counts, artifacts, sweeps, reports, and bounded history scans. This is the preferred way to query W&B from the wandb-driven-dev plugin.

Stats

Version0.1.0

LanguagePython

Stars0

MaintenanceExcellent

Last CommitJun 1, 2026

AddedApr 30, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Available In

wandb-driven-dev-local

Safety Signals

Caution

Uses power tools

Uses Bash, Write, or Edit tools

README

wandb-driven-dev

A Claude Code plugin that enforces experiment-driven development for ML codebases. Every empirical claim is backed by a W&B run, a baseline to compare against, and a falsifier written before the run started. Nothing merges on vibes.

What's inside

Skills

Skill	Trigger	Purpose
`wandb-driven-dev`	`/wandb-driven-dev` (also auto-triggers on "experiment", "ablation", "is A better than B", setup/reconfigure, and W&B Report requests)	The methodology — Phases 0–6 from setup to cleanup, with project-local experiment launcher config, training entrypoint, worktree bootstrap, smoke gates, launch, ETA-aware watcher, review, and experiment report helpers.
`wbagent`	Auto-triggered on W&B queries	Toolkit for querying W&B runs, summaries, configs, histories, artifacts, sweeps, and reports through `wandb_helpers.py`.

Experiment report helpers live in wandb-driven-dev; the general W&B Reports authoring guide (recipe, skeleton, filters, gotchas) lives in the wbagent reference at skills/wbagent/references/REPORTS.md.

Fast count, top-k, and at-step comparison workflows use reusable query primitives in skills/wbagent/scripts/wandb_helpers.py, so common W&B questions do not require one-off Python or broad run iteration.

Project setup discovery is hardcoded in skills/wandb-driven-dev/scripts/setup_project.py. It reads the local config, uses configured decision/health metrics, samples a few recent finished runs when run IDs are not supplied, and writes curves plus wandb_metadata.preflight back to .claude/wandb-driven-dev.local.md. It stores keys and decisions only, not run summaries or metric values. The default path is a single selected-summary GraphQL query, so it does not materialize SDK run objects or scan history. Use --validate-history only when summary coverage is ambiguous; that slower path uses explicit scan_history(keys=...), bounded scans, and targeted sparse metric fallback.

Plot-reading workflows are hardcoded in skills/wandb-driven-dev/scripts/curve_analysis.py. It assumes setup already persisted curve step keys, then turns selected W&B curves into pandas-derived features such as value at step, local slope, trend, and best run by value/slope. Slope/trend use trailing rolling smoothing by default and report noise/confidence fields so noisy endpoints do not dominate the verdict. The analyzer has separate early-training health checks for launch stability and progress-stage checks for slope shifts and sudden spikes.

Agents

Agent	When	Purpose
`wandb-query`	On-demand	Off-thread analysis of a W&B project or run. Frees the main thread from large query outputs.
`reviewer`	Spawned by the `wandb-driven-dev` skill in Phase 5	Reads the staged result, validates numbers against fresh W&B summaries, drafts the verdict and merge recommendation.

Templates

templates/wandb-driven-dev.local.md.template — copy to your project as .claude/wandb-driven-dev.local.md for per-project config (W&B entity/project, repo launcher command, training entrypoint, default metrics, GPU budgets, free-form notes).

Install (local plugin)

Drop this directory into a Claude Code plugins location, or run Claude Code pointing at it directly:

cc --plugin-dir /path/to/wandb-driven-dev

wbagent is vendored directly into skills/wbagent/ as plain files, copied from the upstream W&B core repository at services/wb_agent/src/agent_repository/context_content/production/wbagent/skills/wbagent. The original upstream base commit is recorded in skills/wbagent/.upstream-commit, but this plugin intentionally carries local query-helper extensions in skills/wbagent/scripts/wandb_helpers.py. Do not overwrite skills/wbagent/ with a blind upstream sync; port local improvements to upstream manually and then reconcile the vendored copy deliberately.

Quick start

/wandb-driven-dev setup

Claude interviews you for the W&B project, repo-specific experiment launcher command, training entrypoint, reproduction model, GPU budgets, and decision/health metrics, then writes .claude/wandb-driven-dev.local.md. Subsequent invocations read it.

For a new experiment:

/wandb-driven-dev

Claude walks you through hypothesis → design → smoke → launch → review → cleanup, gating each phase. Wandb runs use the exp/<slug> tag and exp-<slug>-<role> name convention so they're trivially filterable.

Prerequisites

wandb and pandas Python packages on the Python you're running
A W&B account with API key configured (wandb login or WANDB_API_KEY)
For remote training: access to the runner, scheduler, or cluster used by the launcher command recorded in project config

Project config schema

.claude/wandb-driven-dev.local.md (per-project, gitignored):

View full README on GitHub

wandb-driven-dev

Popularity

What's Inside

Confidence

README

wandb-driven-dev

What's inside

Skills

Agents

Templates

Install (local plugin)

Quick start

Prerequisites

Project config schema

Similar Plugins

fullstack-dev-skills

godot-skills

caveman

prompts.chat

ui-design

context7-plugin

wandb-driven-dev

What's inside

Skills

Agents

Templates

Install (local plugin)

Quick start

Prerequisites

Project config schema

Popularity

Health & Quality

Similar Plugins

fullstack-dev-skills

godot-skills

caveman

prompts.chat

ui-design

context7-plugin