From ade-bench
Interactively plan and generate ADE-Bench benchmark tasks from a dbt project. Explores the codebase with the user in a pair-planning loop — understanding the domain, reasoning about what makes good benchmarks, and building up a task plan incrementally. Use when you want high-quality, tailored benchmark tasks for your own dbt project.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ade-bench:plan-tasks <path-to-dbt-project> [custom instructions]<path-to-dbt-project> [custom instructions]The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are pair-planning with the user to create benchmark tasks for ADE-Bench, a framework that evaluates AI agents on dbt and SQL tasks. Your job is to deeply understand the user's dbt project, reason about what bugs would make meaningful benchmarks, and build a task plan collaboratively.
You are pair-planning with the user to create benchmark tasks for ADE-Bench, a framework that evaluates AI agents on dbt and SQL tasks. Your job is to deeply understand the user's dbt project, reason about what bugs would make meaningful benchmarks, and build a task plan collaboratively.
You will plan tasks following this pattern, then generate the full scaffolding.
The user invokes this skill with a path and optional free-form instructions.
/ade-bench:plan-tasks /path/to/project [custom instructions]
Parse:
dbt_project.yml).--no-copy (optional flag): If present, the generated tasks will reference the original project in-place instead of copying it into the output directory. Use this for very large projects to avoid duplication. Default behavior is to copy.Store the custom instructions and flags; they inform every phase.
Before any project-specific work, check whether ade-bench (the harness that actually runs the generated tasks) is available. The plugin can generate tasks without it, but the user can't run them without it.
Detection — in order:
command -v ade — fastest check. If it returns a path and ade --help succeeds, ade-bench is installed.test -f ~/.ade-bench/pyproject.toml && grep -q 'name = "ade-bench"' ~/.ade-bench/pyproject.toml — fallback for cases where the user installed ade-bench but ~/.local/bin isn't on PATH yet.If detected: continue to Step 2.
If missing: tell the user
"ade-bench (the harness that runs the tasks I'm about to generate) isn't installed. I can install it now at
~/.ade-bench(default), or you can pick a different location, or skip this and install it later — generation works either way, but you'll need ade-bench before you can run the tasks."
If the user wants to install: follow the steps in skills/setup/SKILL.md (the same skill the user can also invoke directly as /ade-bench:setup). Pass through the Snowflake context if you've already auto-detected a Snowflake project, so the setup skill knows whether to copy .env.example.
If the user skips: continue, but at the end of Step 8 (or Post-Generation Summary) include a reminder that ade-bench still needs to be installed before tasks can run.
Either way, proceed to Step 2 once the user has answered.
Read all .md files in the knowledge/ directory relative to the plugin root (the directory containing .claude-plugin/). If you don't know the plugin root path, use Glob to find knowledge/base.md and derive it from there. Skip files named _template.md.
These files describe categories of bugs commonly found in dbt/SQL projects — what they are, when they're impactful, how difficulty varies by context, and how common they are in the real world. Use them as inspiration and reasoning material throughout the planning process. You are not constrained to these patterns — if you discover a bug opportunity specific to the user's project that doesn't match any archetype, pursue it.
knowledge/custom-tests.md is a special case — it describes when a bug needs a custom assertion test alongside the AUTO equality test, and gives templates for the common assertion archetypes (refs, sources, materialization, column existence, model existence). Read it once and refer back to it when deciding what tests to ship with each task (see Step 5 / "Decide on custom assertion tests").
Before engaging the user, do a quick scan to have informed questions. Read:
dbt_project.yml — project name, profile, materialization defaultsprofiles.yml — database type (DuckDB or Snowflake), connection detailsls the models/ tree to understand layers (staging, intermediate, marts, etc.)packages.yml — dbt packages in useDo NOT read every model file yet. The goal is to understand the project shape, not to scan exhaustively. You'll go deep in specific areas during the planning loop.
Generated tasks should add coverage, not duplicate bugs the user (or ade-bench's official task collection) has already exercised. Before planning, read what already exists so you can aim for novelty.
Where to look (skip a location if it doesn't exist):
tasks/ dir — at <ade-bench-path>/tasks/. Use the path discovered in Step 1.5. If ade-bench isn't installed, skip this location.{output_dir}/tasks/ already exists from a previous run, read it too.What to read. Glob for task.yaml files in each location. For each one, parse the variants block — if any variant.project_name matches the current dbt project name, this task targets the same project. Read it.
For each matching task, capture compactly:
task_iddescription (one-line)prompts[0].prompt (the agent-facing prompt)tagssetup/changes.patch, note the file(s) changed and the type of mutation (e.g., "constructor_standings.sql L23: MAX → SUM"). Don't quote the entire diff; one line per hunk is enough.If a task uses an external project copy (its project_dir doesn't match the directories you found), still include it — the project_name match is the canonical signal.
Compile a brief like:
Existing tasks for project {project_name} ({N} found):
- f1001 (medium): "Add a source model layer." | adds src_*.sql files, updates stg_* refs
- f1005 (medium): "Constructor points look too high." | int_constructor_standings.sql L18: MAX(points) → SUM(points)
- ...
Hold this brief in working memory for the entire planning loop.
If zero matching tasks were found, note "no existing tasks for {project_name}" and proceed normally — every task you generate is novel by definition.
Ask the user a short set of questions to scope the work. Ask them together in one turn. If the custom instructions already answer some of these, skip those and pre-fill from the instructions.
Also confirm the output directory. Default: ./ade-bench-tasks relative to the dbt project directory.
Wait for the user's answers before proceeding.
Now enter the iterative planning loop. You are pair-planning with the user — explore the code, build up the plan, and ask questions when you hit ambiguities.
Create {output_dir}/task-plan.md with an initial skeleton:
# Task Plan: {project_name}
## Configuration
- Project: {project_name}
- Database: {db_type}
- Tasks requested: {count}
- Difficulty mix: {mix}
- Focus area: {area}
- Custom instructions: {instructions or "none"}
## Exploration Notes
(findings will be captured here as planning progresses)
## Proposed Tasks
(tasks will be added here as candidates are identified)
Repeat this cycle until the plan is complete:
Go deep on the focus area the user specified. Read the actual SQL of candidate models — understand what they do, not just what tokens they contain.
For large projects (>30 models in the focus area), spawn Explore sub-agents in parallel using the Agent tool:
models/intermediate/finance/ and report what each one does, what SQL patterns it uses, and what its upstream/downstream dependencies are")For small projects or narrow focus areas, read the models directly — no agents needed.
When reading a model, think about:
After each exploration round, immediately update {output_dir}/task-plan.md. Capture:
A task entry evolves through stages:
Early (exploring):
### Task 1: (exploring int_account_arr_seats)
- Investigating the ARR pro-ration logic, looks promising for a date function bug
Developing:
### Task 1: mattermost001 — Hard (draft)
- **Model**: int_account_arr_seats
- **Bug idea**: Change DATEADD('day', ...) to DATEADD('month', ...) in the pro-ration
- **Why it's good**: Tests Snowflake date fluency + domain understanding of billing periods
- **Open question**: Need to confirm this model has downstream consumers that would surface the symptom
Ready:
### Task 1: mattermost001 — Hard
- **Model**: int_account_arr_seats (models/intermediate/finance/int_account_arr_seats.sql)
- **Bug**: Change DATEADD('day', ...) to DATEADD('month', ...) on line 47
- **Prompt**: "Finance flagged that ARR calculations for mid-month subscription changes are wildly off. The numbers are way too high for accounts that started mid-billing-cycle."
- **Difficulty**: Hard — requires understanding the pro-ration business logic to know which date unit is correct
- **Rationale**: Tests Snowflake DATEADD fluency + domain understanding of ARR pro-ration. The agent must read the business logic, not just grep for syntax errors.
- **Models to rebuild**: int_account_arr_seats, mart_arr
- **Solution seed table**: int_account_arr_seats
- **Custom assertion tests**: none (value bug — equality catches it)
- **Novelty**: distinct from f1005/f1006 (those swap MAX↔SUM on cumulative points; this swaps date units in pro-ration logic — different mutation type, different debugging skill)
- **Status**: ready
For each task, decide whether the bug needs a custom assertion test in addition to the AUTO equality test the harness generates from solution_seeds. Decide based on the bug type, not by asking the user. The decision rule, from knowledge/custom-tests.md:
When a custom test is needed, write it directly using the templates in knowledge/custom-tests.md. Use real names from the user's project — substitute MODEL_NAME, PROJECT_NAME, REF_1, etc. in the templates with the actual values. Record the test SQL (or a summary of it) in the task plan entry under Custom assertion tests so the user sees it during approval.
When you hit something you can't resolve from code alone, ask. Good questions:
int_opportunity_ext model has 23 JOINs. Is this realistic for your team or is it legacy code nobody touches?" (determines if it's a good benchmark target)stg_subscriptions (Snowflake-specific) or the LEFT JOIN in int_known_licenses (ANSI). Which tests a more valuable skill for your use case?"Question guidelines:
The plan is ready when every task entry has:
Present the complete plan to the user and ask for approval. The user may:
Only after the user approves the plan. For each task, generate the complete file structure.
Use the project name as prefix: {project_name}001, {project_name}002, etc.
With project copy (default):
{output_dir}/
├── task-plan.md # The plan (already exists from planning)
├── projects/
│ └── {project_name}/ # Full copy of the dbt project
└── tasks/
└── {task_id}/
├── task.yaml
├── setup.sh
├── solution.sh
├── setup/
│ └── changes.patch
├── solutions/
│ └── changes.patch
├── tests/ # Custom assertion tests written during planning
│ # (AUTO_*.sql added later by the harness)
└── seeds/ # Populated after seed generation
With --no-copy (reference in-place):
{output_dir}/
├── task-plan.md
└── tasks/
└── {task_id}/
└── (same task files, but task.yaml points to the original project location)
The project_dir field depends on whether --no-copy was used:
{output_dir}/projects (where the project was copied)--no-copy: points to the parent directory of the original project (e.g., if the project is at /path/to/my-project, project_dir is /path/to)task_id: {task_id}
status: ready
description: "{one-line description of what the bug is — this is for task authors, not agents}"
prompts:
- key: base
prompt: |-
{the symptom prompt from the plan — what the agent sees.
Must sound like a real user report. Must NOT reveal the bug.}
author_name: ade-bench-plugin
author_email: ""
difficulty: {easy|medium|hard}
tags:
- dbt
- {relevant tags}
- {project_name}
test_setup: |-
dbt run --select {models to rebuild from the plan}
solution_seeds:
- table_name: {solution seed table from the plan}
variants:
- db_type: {duckdb|snowflake}
db_name: {db_name}
project_type: dbt
project_name: {project_name}
project_dir: {see above}
# Add `db_dir: {absolute_path_to_db_directory}` ONLY for DuckDB and ONLY if the user
# supplied an explicit DuckDB directory. If omitted, the harness falls back to
# `~/.ade-bench/shared/databases/duckdb/{db_name}.duckdb` — which is what you want
# for any of the bundled ade-bench projects (airbnb, f1, quickbooks, etc.).
# Never set `db_dir` for Snowflake.
A unified diff that introduces the bug into the working project. Generate from the specific mutation identified in the plan.
Before writing the patch, read the target model file and verify:
patch -p1 from the project rootFormat:
--- a/{relative_path_to_model}
+++ b/{relative_path_to_model}
@@ -{line},{count} +{line},{count} @@
{3 lines of context}
-{original line (correct)}
+{mutated line (buggy)}
{3 lines of context}
The exact reverse of the setup patch — fixes the bug. Always name it changes.patch.
Note: ade-bench reserves
changes.<db>.patch(e.g.,changes.snowflake.patch) for an additive delta used in cross-db tasks that target both DuckDB and Snowflake from a single source project. Don't emit that — this plugin generates single-variant tasks.
#!/bin/bash
patch -p1 < /app/setup/changes.patch
dbt deps
dbt run
#!/bin/bash
patch -p1 < /sage/solutions/changes.patch
If the task plan entry has a custom assertion test, write the SQL to {task_dir}/tests/<filename>.sql. Filenames must NOT start with AUTO_ (that prefix is reserved by the harness for auto-generated equality/existence tests). Use a stable convention like {task_id}_<purpose>.sql (e.g., mattermost001_refs.sql, mattermost001_materialization.sql).
Use the templates in knowledge/custom-tests.md and substitute the user's real model/project/column names. Include -- depends_on: {{ ref('<model>') }} at the top so dbt builds the model before running the test.
If a task has no custom assertion tests (value bug, equality alone is sufficient), leave {task_dir}/tests/ empty — the harness will populate AUTO_*.sql files at first run.
If --no-copy was NOT used (default):
Copy the user's dbt project into {output_dir}/projects/{project_name}/:
mkdir -p {output_dir}/projects
cp -r {project_path} {output_dir}/projects/{project_name}
This makes the output self-contained — the tasks, project, and plan are all in one directory that can be moved, shared, or committed to version control.
If --no-copy was used:
Do not copy the project. The task.yaml project_dir will point to the original project's parent directory. Warn the user:
"Tasks reference your project in-place at
{project_path}. The output directory is not self-contained — if the project moves, the tasks will break."
This step is mandatory — run it for every generated task before presenting results to the user. It catches malformed patches without requiring Docker or ADE-Bench.
Determine the project location:
{output_dir}/projects/{project_name}--no-copy: the original project pathFor each task, using the project at that location:
Create a temp working copy of the project:
cp -r {output_dir}/projects/{project_name} /tmp/ade-bench-validate-{task_id}
Apply the setup patch (introduces the bug):
cd /tmp/ade-bench-validate-{task_id} && patch -p1 < {output_dir}/tasks/{task_id}/setup/changes.patch
If this fails, the patch has wrong context lines or paths. Fix the patch and retry.
Verify the mutation happened — read the patched file and confirm the buggy line is present.
Apply the solution patch (fixes the bug):
cd /tmp/ade-bench-validate-{task_id} && patch -p1 < {output_dir}/tasks/{task_id}/solutions/changes.patch
If this fails, the solution patch doesn't match the post-setup state. Fix and retry.
Verify the file is restored — diff the patched file against the original. They must be identical:
diff {output_dir}/projects/{project_name}/{model_path} /tmp/ade-bench-validate-{task_id}/{model_path}
If there's a diff, the setup and solution patches are not exact inverses. Fix and retry.
Clean up:
rm -rf /tmp/ade-bench-validate-{task_id}
If any task fails validation, fix the patches before proceeding. Do not present broken tasks to the user. Common fixes:
@@ hunk header is off, recalculate from the actual fileReport results to the user:
Patch validation:
mattermost001: setup ✓ solution ✓ roundtrip ✓
mattermost002: setup ✓ solution ✓ roundtrip ✓
mattermost003: setup ✓ solution ✓ roundtrip ✓
After patch validation passes, ask the user:
"All patches validated. Want me to run the full ADE-Bench harness to verify the tasks end-to-end? This requires Docker running and ADE-Bench installed. I'll run the sage agent to confirm each task's setup/solution cycle works and generate seed CSVs."
If the user declines, skip to post-generation instructions. If they accept:
DuckDB tasks:
ade run {task_id} --db duckdb --project-type dbt --agent sage --seed \
--tasks-dir {output_dir}/tasks
Snowflake tasks:
ade run {task_id} --db snowflake --project-type dbt --agent sage \
--tasks-dir {output_dir}/tasks
Note: --seed CSV export is only supported on DuckDB. For Snowflake, run without --seed.
Check the harness output for each task:
dbt run failed in the broken state. This is expected for some bug types (e.g., TRY_TO_* removal causes compilation errors). Verify the error is the intended symptom.Report the failure to the user with the error details, then ask:
"Task {task_id} failed harness validation: {error summary}. Want me to diagnose and fix it, or skip this task?"
If fixing: investigate the error, update the patches or task.yaml as needed, and re-run. This may require going back into the planning loop to adjust the bug or switch to a different model.
Report final results:
Harness validation:
mattermost001: PASS (seeds generated)
mattermost002: PASS (seeds generated)
mattermost003: FAIL — solution didn't compile (investigating...)
If seeds were generated (DuckDB with --seed), run once more without --seed to confirm tests pass against the generated seeds:
ade run {task_id} --db duckdb --project-type dbt --agent sage \
--tasks-dir {output_dir}/tasks
After all validation is complete, tell the user:
{output_dir}/tasks/, {output_dir}/projects/, {output_dir}/task-plan.mdDuckDB:
ade run {task_id} --db duckdb --project-type dbt --agent claude --model claude-sonnet-4-5 \
--tasks-dir {output_dir}/tasks
Snowflake:
# Requires: SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, SNOWFLAKE_PASSWORD, SNOWFLAKE_ROLE, SNOWFLAKE_WAREHOUSE
ade run {task_id} --db snowflake --project-type dbt --agent claude --model claude-sonnet-4-5 \
--tasks-dir {output_dir}/tasks
ade run --dataset-config {output_dir}/tasks --db {db_type} --project-type dbt --agent claude \
--tasks-dir {output_dir}/tasks
npx claudepluginhub typedef-ai/ade-bench-plugin --plugin ade-benchSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.