From braintrust-skills
Improve Braintrust-backed AI agents through an evidence-backed loop over production traces, measurement, datasets, code changes, evals, and portable exit handoffs. Use when starting an agent-improvement session, running CI/scheduled/post-deploy flywheels, investigating production score degradation, finding eval coverage gaps, or deciding whether to add/update scorers, datasets, instrumentation, or agent behavior.
How this skill is triggered — by the user, by Claude, or both
Slash command
/braintrust-skills:bt-flywheelThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Five-phase cycle: Orient → Discover → Diagnose → Improve → Verify & Decide. On exit, write a portable handoff contract for the caller; do not execute external side effects.
agents/openai.yamlreferences/bt-eval-patterns.mdreferences/bt-flywheel-output-templates.mdreferences/bt-functions-patterns.mdreferences/bt-sql-patterns.mdreferences/bt-sync-patterns.mdreferences/bt-topics-patterns.mdreferences/bt-view-patterns.mdschemas/bt-flywheel-summary.schema.jsonscripts/bt-curate-patterns.pyFive-phase cycle: Orient → Discover → Diagnose → Improve → Verify & Decide. On exit, write a portable handoff contract for the caller; do not execute external side effects.
Every phase can surface the need to change any of these:
| Artifact | Where it lives | How to change |
|---|---|---|
| Agent | Customer codebase (code files) | Code edits |
| Measurement | Braintrust scorers/facets/classifiers or codebase | bt functions push or code edit |
| Datasets | Braintrust | bt datasets create/update/view/list |
| Instrumentation | Customer codebase/logging config | Code edits |
Load these when executing the relevant phase:
references/bt-sql-patterns.md — SQL query templates for Discover and Verify & Decidereferences/bt-view-patterns.md — bt view command patternsreferences/bt-eval-patterns.md — eval invocation patternsreferences/bt-functions-patterns.md — scorer, prompt, and dataset CLI patternsreferences/bt-sync-patterns.md — bulk log/experiment/dataset sync (pull/push)references/bt-topics-patterns.md — Topics automation (input clustering, classification)scripts/bt-curate-patterns.py — ground truth labeling, split assignment, dataset row construction/upsertreferences/bt-flywheel-output-templates.md — bt-flywheel-summary.json handoff and bt-flywheel-narrative.md templatesBefore starting, check for autonomous mode signals in order:
mode: autonomous in the invocationCI=true environment variableFLYWHEEL_AUTONOMOUS=true environment variableIf any signal is present: autonomous mode — suppress all gates, log all decisions, write bt-flywheel-summary.json and bt-flywheel-narrative.md on exit.
Otherwise: interactive mode — present plans before irreversible actions and wait for confirmation.
Establish session context before running any queries.
Step 1 — Resolve the active project (never ask the user for a project ID):
Check in this order:
.bt/config.json in the working directory — written by bt setup, contains project (name) and/or project_idAGENTS.md, CLAUDE.md, .cursor/rules, .github/copilot-instructions.md, or similar files may document project name, ID, score columns, eval paths, dataset namesbt projects list --json — resolve name → ID programmatically# Check local bt config
cat .bt/config.json 2>/dev/null
# If project name is known but ID is not:
bt projects list --json
# Find the entry where "name" matches → use its "id"
# If nothing is configured, list for user to choose by name:
bt projects list
If .bt/config.json is absent, suggest: "Run bt setup in this directory to configure the active project — it stores project and project_id in .bt/config.json so every bt command automatically targets the right project."
Step 2 — Establish goal:
In interactive mode, ask: What metric or behavior to improve? (default if skipped: "general health check")
In autonomous mode, read from FLYWHEEL_GOAL env var (fallback: "general health check").
Step 3 — Resolve baseline experiment (never ask the user for an experiment ID):
Check in this order:
FLYWHEEL_BASELINE_EXPERIMENT env varbt eval output in this session — the experiment ID is printed on completionbt experiments list --json -p <project-name>
# Returns objects with: id, name, created
# Sort by created descending; use the most recent as baseline
Schema Discovery (always run after project ID is resolved):
Check project instruction files first for:
scores."Response Quality")If not documented there, run schema introspection using the resolved <PROJECT_ID> with progressive time window expansion:
# Try 1 day
bt sql "SELECT * FROM project_logs('<PROJECT_ID>') WHERE is_root = true AND created >= NOW() - INTERVAL 1 day LIMIT 1"
# If no results, try 7 days
bt sql "SELECT * FROM project_logs('<PROJECT_ID>') WHERE is_root = true AND created >= NOW() - INTERVAL 7 day LIMIT 1"
# If still no results, try 30 days
bt sql "SELECT * FROM project_logs('<PROJECT_ID>') WHERE is_root = true AND created >= NOW() - INTERVAL 30 day LIMIT 1"
Inspect the returned row to identify scores.* and facets.* column names. Null values for nested fields still reveal the column name structure.
If no rows found after 30 days: note this and proceed with generic queries.
Store all discovered column names and the resolved project ID for use in subsequent phases.
Resolve app URL and org slug (needed to construct Braintrust links in the summary):
bt status --json
# Returns: { "org": "<org-name>", ... }
Braintrust URLs (URL-encode spaces as %20):
https://www.braintrust.dev/app/<org>/p/<project-name>/experiments/<experiment-id>https://www.braintrust.dev/app/<org>/p/<project-name>/r/<trace-id>Mine production traces for patterns the agent has not been evaluated on. The goal is a truthful map of production behavior, not merely running prewritten queries.
Load references/bt-sql-patterns.md; use its examples for Braintrust SQL syntax, data shapes, search(), MATCH, ANY_SPAN(), score unpivoting, and aggregation patterns. Load references/bt-topics-patterns.md if Topics or facets are available. Use a 7-day window by default, expand for low traffic, and always keep project_logs() queries bounded by time/range or specific IDs.
Run discovery as an iterative hypothesis loop:
search('<term>') or field-level MATCH terms from error messages, tool names, refusals, user language, output-format failures, policy words, domain concepts, or newly observed failure labels.references/bt-view-patterns.md and inspect full span trees before attributing root cause.Explore evidence planes including reliability, quality, coverage, behavior/tooling, performance/cost, change correlation, and eval alignment.
Before Diagnose, write findings with enough detail to support or reject root causes:
DISCOVER FINDINGS:
- Window and traffic: <time range>, <N traces/spans>, low-traffic caveats
- Strongest signals: <aggregate/statistic> with query/trace evidence
- Failure modes: <specific behavior>, example trace IDs/URLs, affected cohorts
- Healthy cohorts: <what works>, so fixes stay targeted
- Production vs eval gaps: <missing inputs/behaviors/labels>
- Measurement gaps: <failure mode not scored/tagged/faceted>, proposed scorer/facet
- Confidence and unknowns: <what still needs trace inspection or human labels>
Synthesize Discover findings and determine what needs to change. This is the routing intelligence of the flywheel — reason carefully before producing an action plan. The examples below are diagnostic prompts, not a closed taxonomy.
For each candidate root cause, name the supporting evidence, counterevidence, confidence, and artifact to change. Prefer "measure first" when production shows a real problem but current evals/scorers cannot observe it.
Is this a measurement gap or new scorer/facet needed?
Is this a scorer problem?
Is this a dataset gap?
Is this an agent problem?
Is this instrumentation or observability drift?
Is this a structural change needed?
Nothing actionable?
bt-flywheel-summary.json before exiting.Produce a prioritized action plan listing which artifacts to change and in what order. Multiple conditions can apply — list them in priority order and execute sequentially.
In interactive mode: Present the action plan with reasoning. Wait for confirmation or override before proceeding. Honor any steps the user wants to skip.
In autonomous mode: Log the full action plan and proceed immediately.
Apply the artifact route chosen in Diagnose. Multiple routes can apply; execute them in priority order with measurement/instrumentation fixes before optimizing agent behavior.
In interactive mode: Present the planned artifact changes and wait for confirmation before irreversible writes. In autonomous mode: Apply the action plan and log each change with evidence, files/functions touched, and why it was safe.
Load references/bt-functions-patterns.md.
Create or update a scorer, classifier, or facet when Diagnose found a measurement gap or stale criteria:
blocked or needs_work and add a label_data next step.If the function lives in Braintrust, read it first with bt functions view <slug> -p <project-name>, make a targeted local change, then push with bt functions push -p <project-name> --file <path>. If it lives in the codebase, edit the scorer file directly.
Load references/bt-functions-patterns.md for dataset CLI examples. Load references/bt-sql-patterns.md only when SQL filtering over dataset content is needed.
Pull balanced failing and passing candidates so the dataset does not skew toward hard cases only:
bt sql "SELECT id, input, output, scores.\"<SCORE_COL>\" FROM project_logs('<PROJECT_ID>')
WHERE scores.\"<SCORE_COL>\" <= 0.5 AND created >= NOW() - INTERVAL 7 day
ORDER BY RANDOM() LIMIT 50"
bt sql "SELECT id, input, output, scores.\"<SCORE_COL>\" FROM project_logs('<PROJECT_ID>')
WHERE scores.\"<SCORE_COL>\" >= 0.8 AND created >= NOW() - INTERVAL 7 day
ORDER BY RANDOM() LIMIT 50"
Inspect candidate traces with bt view trace --object-ref project_logs:<project-id> --trace-id <id> --json and extract the root input. Do not use bad production output as expected; use scripts/bt-curate-patterns.py for generated ground truth, deterministic train/validation splits, stable row IDs, and provenance metadata. In interactive mode, show a sample of generated labels before writing.
Use bt datasets create/update with --id-field id. If the agent interface changed, inspect and update stale rows so evals fail only for real behavior gaps, not obsolete data shape.
Edit the agent codebase only after trace/eval evidence points to behavior the agent owns. Keep the change targeted: system prompt, tool schema, output format/parser, routing/trajectory logic, or retrieval/tool-use behavior. Before editing, read the relevant files and local instructions. Do not run git add or git commit; the caller owns git operations.
Fix instrumentation when production behavior cannot be reconstructed well enough to diagnose, curate, or verify. Typical changes: restore root input/output logging, add compact metadata used for slicing, expose tool arguments/results safely, repair trace IDs, or normalize renamed score/facet fields. Keep large payloads out of inline logs unless they are required for eval/debugging.
Run evals, compare against baseline, inspect regressions, route the next loop, and write the exit handoff when stopping.
Load references/bt-eval-patterns.md. Find eval files from project instructions, find . -name "eval_*.py" -o -name "eval_*.ts" | grep -v node_modules | grep -v .venv, or an evals/ directory.
If the dataset has split metadata, use train for smoke/iteration and validation for the final measurement. Always run a smoke eval before a full eval:
set -a && source .env && set +a
bt eval --first 20 <eval_file>
If smoke scores are near zero or the eval is structurally broken, stop and route back to Improve. If smoke passes, run the full eval and capture the experiment ID/URL:
set -a && source .env && set +a
bt eval <eval_file>
# If bt eval fails, fall back to:
braintrust eval --env-file .env <eval_file>
Load references/bt-sql-patterns.md for experiment query templates and references/bt-view-patterns.md for trace drill-in. Compare new vs baseline score statistics, check scorer distributions, and inspect regressions with full traces:
bt sql "SELECT AVG(scores.\"<SCORE_COL>\") AS avg, MIN(scores.\"<SCORE_COL>\") AS min FROM experiment('<new-id>')"
bt sql "SELECT AVG(scores.\"<SCORE_COL>\") AS avg, MIN(scores.\"<SCORE_COL>\") AS min FROM experiment('<baseline-id>')"
bt sql "SELECT id, scores.\"<SCORE_COL>\" FROM experiment('<new-id>') WHERE scores.\"<SCORE_COL>\" < 0.5"
Compile a verdict with metric deltas, regression count, scorer health, new failure patterns, remaining dataset gaps, and confidence.
When multiple conditions apply, address them in this priority order:
| Condition | Next route |
|---|---|
| Production failure is real but unmeasured | Improve: measurement |
| Scorer distribution stuck at 0/1 or clearly wrong | Improve: measurement |
| Trace data cannot support diagnosis | Improve: instrumentation |
| New failure pattern emerged, not in datasets | Discover focused → Improve: dataset |
| Metric improved on validation but not on train | Improve: dataset |
| Metric did not move despite agent change | Improve: agent |
| Metric improved and new edge cases were found | Improve: dataset → Verify |
| Datasets do not cover eval-discovered cases | Improve: dataset → Verify |
| Metric improved, no regressions, measurement healthy | Exit handoff with outcome improved |
| Production healthy and no follow-up needed | Exit handoff with outcome healthy |
In interactive mode, present the routing decision and let the user override or stop. In autonomous mode, route automatically. If the target metric has not improved after 3 full loop iterations, stop with outcome no_convergence.
Write bt-flywheel-summary.json and bt-flywheel-narrative.md when stopping in autonomous mode, and present the same content in interactive mode. The handoff is adapter-neutral: it should help a local developer, triggered workflow, GitHub Action, embedded app, or internal orchestrator decide what to do next. Do not open PRs, create issues, send Slack/Jira/Linear messages, call webhooks, block deploys, or trigger rollbacks from the skill itself.
Use references/bt-flywheel-output-templates.md for the schema and examples. Required handoff concepts:
outcome: healthy, improved, needs_work, blocked, or no_convergenceseverity, blocking, and confidencesummary, evidence-backed findings, artifact-specific changes, and verificationlinks for Braintrust experiments, traces, queries, docs, or external referencesartifacts for local files produced by the runnext_steps with adapter-neutral intent, priority, title/body, optional suggested destination, and idempotency keyPrefer intent over destination. Example intents: no_action, review_change, investigate, label_data, rerun, notify, block_release, rollback. Example suggested destinations: local_summary, code_review, issue_tracker, chat, release_gate, scheduler, app_ui, external_system, none.
The flywheel run is complete when any of the following is true:
improved: target metric improved, no blocking regressions, measurement healthyhealthy: production is already healthyCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub dpguthrie/braintrust-skills --plugin bt-flywheel