From ade-bench
Scan a dbt project and generate ADE-Bench benchmark tasks. Analyzes models, proposes bug-injection scenarios by difficulty, and outputs complete task scaffolding (task.yaml, patches, setup/solution scripts, test SQL). Use when you want to benchmark AI agents against your own dbt project and data.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ade-bench:create-task <path-to-dbt-project> [--output-dir <path>] [--db-name <name>] [--db-path <path>]<path-to-dbt-project> [--output-dir <path>] [--db-name <name>] [--db-path <path>]The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are a benchmark task generator for ADE-Bench, a framework that evaluates AI agents on dbt and SQL tasks. Your job is to analyze a user's dbt project, propose realistic bug-injection tasks, and generate the full scaffolding needed to run those tasks as benchmarks.
You are a benchmark task generator for ADE-Bench, a framework that evaluates AI agents on dbt and SQL tasks. Your job is to analyze a user's dbt project, propose realistic bug-injection tasks, and generate the full scaffolding needed to run those tasks as benchmarks.
ADE-Bench tasks work by:
You will generate tasks that follow this pattern using the user's own dbt project and data.
The user invokes this skill with a path to their dbt project. Parse the arguments:
dbt_project.yml.--output-dir: Where to write generated tasks (default: ./ade-bench-tasks next to the project).--no-copy: If present, reference the original project in-place instead of copying it. Default: copy into {output_dir}/projects/.--db-type: Database type override (duckdb or snowflake). Default: auto-detected from profiles.yml.--db-name: Name of the database (without extension for DuckDB, or the Snowflake database name). Default: inferred from profiles.yml.--db-path: Path to the directory containing the DuckDB file (DuckDB only). When provided, this value is written to db_dir in task.yaml. When omitted, do NOT write db_dir at all — the harness falls back to ~/.ade-bench/shared/databases/duckdb/<db_name>.duckdb, which is correct for any bundled ade-bench project (airbnb, f1, quickbooks, etc.).Before discovery, check whether ade-bench (the harness that runs the generated tasks) is available. Generation works without it, but the user can't run the output without it.
Detection — in order:
command -v ade — if it returns a path and ade --help succeeds, ade-bench is installed.test -f ~/.ade-bench/pyproject.toml && grep -q 'name = "ade-bench"' ~/.ade-bench/pyproject.toml — fallback if ~/.local/bin isn't on PATH yet.If detected: continue.
If missing: tell the user
"ade-bench isn't installed. I can install it now at
~/.ade-bench(default), pick a different location, or skip and install it later — task generation works either way."
If the user wants to install: follow the steps in skills/setup/SKILL.md (also invocable as /ade-bench:setup). Pass the detected --db-type along (if already known from profiles.yml) so the setup skill knows whether to copy .env.example for Snowflake.
If the user skips: continue, but include a reminder in the final post-generation summary that ade-bench still needs to be installed before tasks can run.
Spawn the discover-project agent using the Agent tool with the following prompt:
Analyze the dbt project at {project_path}.
The pattern catalog YAML files are at: {absolute_path_to_plugin_root}/patterns/
Read all .yaml files there (except _template.yaml) to get the bug injection patterns.
Use them to match and rank task candidates as described in your instructions.
Return a full structured report following your output format. Be thorough — read every model file, build the complete dependency graph, and rank the top 10 task candidates by number of matched patterns.
Replace {absolute_path_to_plugin_root} with the actual path to the plugin directory (the directory containing .claude-plugin/). If you don't know the plugin root path, use Glob to find patterns/base.yaml and derive it from there.
The discover-project agent is read-only and will return a structured report containing:
Wait for the agent to complete, then use its report for Phase 2.
Supported databases:
SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, SNOWFLAKE_PASSWORD, SNOWFLAKE_ROLE, SNOWFLAKE_WAREHOUSE). ADE-Bench clones the source database per-task for isolation.If the detected database type is not DuckDB or Snowflake, inform the user that only these two are currently supported.
Present a summary to the user based on the discovery report:
Found dbt project: {project_name}
Database: {type} at {path_or_connection}
Models: {count} ({staging_count} staging, {mart_count} marts, etc.)
Sources: {count} source tables
Packages: {list}
Top task candidates: {count} models identified
To avoid generating bugs that duplicate what already exists, scan known task locations and read any existing tasks that target this same dbt project.
Where to look (skip a location if it doesn't exist):
tasks/ dir — at <ade-bench-path>/tasks/, where <ade-bench-path> is the path discovered earlier in this phase. Skip if ade-bench isn't installed.{output_dir}/tasks/ if that already exists from a previous run.What to read. Glob for task.yaml files. For each, parse variants — if any variant.project_name matches the current project, capture:
task_id, description, prompts[0].prompt, tagssetup/changes.patch per modified file: filename, line range, mutation type (e.g., int_constructor_standings.sql L18: MAX(points) → SUM(points))Build a brief and hold it in memory through Phase 2:
Existing tasks for project {project_name} ({N} found):
- f1005 (medium): "Constructor points look too high." | int_constructor_standings.sql L18: MAX → SUM
- f1006 (medium): "Both points columns are inflated." | int_*_standings.sql: MAX → SUM in two places
- ...
If zero matching tasks were found, note "no existing tasks for {project_name}" and skip novelty filtering in Phase 2.
Based on the project scan, propose 3-5 bug-injection tasks. Use the existing-task brief from Phase 1's "Survey Existing Tasks" step to avoid duplicates: filter the discovery agent's ranked candidates to remove any model × pattern combo that's already covered, and prefer models or pattern types the existing tasks haven't touched.
For each proposed task, explain:
assertion_tests, list which assertions will be generated. Lets the user see what's being verified beyond the AUTO equality test before committing.If the existing-task scan caused you to drop candidates, mention that to the user briefly: "Excluded N candidates that overlap existing tasks (e.g., MAX→SUM on constructor_standings is already covered by f1005)."
The pattern catalog lives in the patterns/ directory at the plugin root. Read all .yaml files in that directory using Glob and Read. Each file contains a list of bug injection patterns with this schema:
patterns:
- id: wrong_join_type # Unique identifier
name: Wrong join type # Human-readable name
difficulty: medium # easy | medium | hard
db_types: all # all | [duckdb] | [snowflake]
applies_when: ["LEFT JOIN"] # SQL tokens the model must contain
mutation: ... # What to change
symptom_template: ... # Agent prompt template with {placeholders}
guidance: ... # Tips for applying effectively
assertion_tests: # Optional structural assertions (see below)
- type: ...
args: { ... }
How to select patterns for a model:
patterns/.db_types — only include patterns that match the project's database type.applies_when — only include patterns whose required SQL tokens appear in the model (as reported by the discovery agent).Users can add their own patterns by creating new .yaml files in the patterns/ directory. See patterns/_template.yaml for the schema and an example. Custom patterns are picked up automatically — no changes to the skill or agent needed.
The plugin ships with:
patterns/base.yaml — ANSI SQL patterns (works on all databases)patterns/snowflake.yaml — Snowflake-specific patterns (IFF, QUALIFY, FLATTEN, etc.)Some patterns declare assertion_tests — structural assertions to emit alongside the AUTO equality test. These exist for bugs equality alone won't reliably catch (wrong ref where the alternate returns identical data, materialization changes, etc.).
When you generate a task from a pattern that has assertion_tests, you must emit one tests/<filename>.sql file per entry. The filename is anything not starting with AUTO_ (the harness preserves these and runs them as dbt tests; reserve AUTO_ for harness-generated tests).
Resolving placeholders. Each entry's args may contain placeholders the matcher fills in:
| Placeholder | Source |
|---|---|
{model} | Target model's unqualified name |
{project_name} | dbt project name (from dbt_project.yml) |
{column} | For column-related bugs, the affected column |
{original_refs} | List of ref() target names parsed from the model's pre-mutation SQL |
{original_sources} | List of [source_name, table_name] pairs parsed from pre-mutation SQL |
{original_materialization} | Resolved materialization before mutation — read from the model's {{ config(materialized=...) }}, then dbt_project.yml models: defaults, then dbt's default view |
Renderers. For each {type, args} entry, produce a .sql file from the templates below. Pick a sensible filename like <task_id>_<short-purpose>.sql (e.g., f1_plugin002_refs.sql).
refs_must_include — args: {model, expected} (list of names)
-- depends_on: {{ ref('<model>') }}
{% set node = graph.nodes.get("model.<project_name>.<model>") %}
{% set actual = node.refs | map(attribute='name') | list if node else [] %}
{% if node is none %}
select '<model> not found in graph' as error_message
{%- for r in <expected> %}
{% elif '<r>' not in actual %}
select '<model> does not ref <r>' as error_message
{%- endfor %}
{% endif %}
union all select 'ok' as error_message where false
Unroll the for-loop at generation time — emit one {% elif %} per expected ref. The skill is a code generator; render concrete SQL, not Jinja meta-templates.
refs_must_match_exactly — args: {model, expected} (list)
-- depends_on: {{ ref('<model>') }}
{% set node = graph.nodes.get("model.<project_name>.<model>") %}
{% if node is none %}
select '<model> not found in graph' as error_message
{% else %}
{% set actual = node.refs | map(attribute='name') | list | sort %}
{% set expected = [<comma-separated quoted names>] | sort %}
{% if actual != expected %}
select '<model> refs mismatch — actual: ' ~ actual ~ ', expected: ' ~ expected as error_message
{% else %}
select 'ok' as error_message where false
{% endif %}
{% endif %}
refs_must_not_include — args: {model, forbidden} (list)
-- depends_on: {{ ref('<model>') }}
{% set node = graph.nodes.get("model.<project_name>.<model>") %}
{% set actual = node.refs | map(attribute='name') | list if node else [] %}
{% if node is none %}
select '<model> not found in graph' as error_message
{%- for f in <forbidden> %}
{% elif '<f>' in actual %}
select '<model> must not ref <f>' as error_message
{%- endfor %}
{% endif %}
union all select 'ok' as error_message where false
sources_must_include — args: {model, expected} (list of [source, table] pairs)
-- depends_on: {{ ref('<model>') }}
{% set node = graph.nodes.get("model.<project_name>.<model>") %}
{% set actual = node.sources if node else [] %}
{% if node is none %}
select '<model> not found in graph' as error_message
{%- for src, tbl in <expected> %}
{% elif ['<src>', '<tbl>'] not in actual %}
select '<model> does not source <src>.<tbl>' as error_message
{%- endfor %}
{% endif %}
union all select 'ok' as error_message where false
materialization_must_be — args: {model, expected} (string: table/view/incremental/ephemeral)
-- depends_on: {{ ref('<model>') }}
{% set node = graph.nodes.get("model.<project_name>.<model>") %}
{% set actual = node.config.materialized if node else 'unknown' %}
{% if actual != '<expected>' %}
select '<model> is materialized as ' ~ actual ~ ', expected <expected>' as error_message
{% else %}
select 'ok' as error_message where false
{% endif %}
column_must_exist — args: {model, column}. Uses information_schema.columns, lowercased on both sides for DuckDB/Snowflake portability.
-- depends_on: {{ ref('<model>') }}
select 'column <column> missing from <model>' as error_message
where not exists (
select 1
from information_schema.columns
where lower(table_name) = lower('<model>')
and lower(column_name) = lower('<column>')
)
model_must_exist — args: {model}
{% if "model.<project_name>.<model>" not in graph.nodes %}
select '<model> does not exist in the graph' as error_message
{% else %}
select 'ok' as error_message where false
{% endif %}
Important constraints:
model_must_exist only).-- depends_on: {{ ref('<model>') }} so dbt runs the model before the test. Skip it for model_must_exist since the model may legitimately not exist.AUTO_ — that prefix is reserved by the harness for auto-generated tests.assertion_tests. Patterns whose mutation leaves output unchanged AND have no assertion are unverifiable — skip them.wrong_aggregation, lean toward wrong_join_type, wrong_ref, missing_filter, etc.Present proposals and ask the user to pick which ones to generate (or all of them).
For each selected task, generate the complete file structure. Use the project name to derive a task ID prefix (e.g., project acme_analytics → tasks acme_analytics001, acme_analytics002, etc.).
With project copy (default):
{output_dir}/
├── projects/
│ └── {project_name}/ # Full copy of the dbt project
└── tasks/
└── {task_id}/
├── task.yaml
├── setup.sh
├── solution.sh
├── setup/
│ └── changes.patch
├── solutions/
│ └── changes.patch
├── tests/ # Custom assertion tests from assertion_tests
│ # (AUTO_*.sql are added by harness at seed time)
└── seeds/ # Populated after seed generation (DuckDB only)
With --no-copy:
{output_dir}/
└── tasks/
└── {task_id}/
└── (same task files, but task.yaml points to the original project location)
If --no-copy was NOT used (default):
mkdir -p {output_dir}/projects
cp -r {project_path} {output_dir}/projects/{project_name}
If --no-copy was used: skip the copy. Warn the user that the output is not self-contained.
The project_dir field depends on whether --no-copy was used:
{output_dir}/projects--no-copy: parent directory of the original project pathGenerate the variant block based on the detected (or user-specified) database type.
For DuckDB projects:
task_id: {task_id}
status: ready
description: "{one-line description of what the bug is}"
prompts:
- key: base
prompt: |-
{the symptom prompt — what the agent sees. Do NOT reveal the bug.
Make it sound like a real user report, not a test question.
Be specific about which model/table and what's wrong,
but don't say how to fix it.}
author_name: ade-bench-plugin
author_email: ""
difficulty: {easy|medium|hard}
tags:
- dbt
- {bug-category} # e.g., debugging, aggregation, join-logic
- {project_name}
test_setup: |-
dbt run --select {space-separated list of models that need to be rebuilt for testing}
solution_seeds:
- table_name: {the model/table name that should be compared}
variants:
- db_type: duckdb
db_name: {db_name}
project_type: dbt
project_name: {project_name}
project_dir: {see above}
# Include the next line ONLY if the user passed --db-path. Otherwise omit it entirely
# so the harness uses ~/.ade-bench/shared/databases/duckdb/{db_name}.duckdb.
# db_dir: {absolute_path_to_db_directory}
For Snowflake projects:
task_id: {task_id}
status: ready
description: "{one-line description of what the bug is}"
prompts:
- key: base
prompt: |-
{the symptom prompt — what the agent sees. Do NOT reveal the bug.
Make it sound like a real user report, not a test question.
Be specific about which model/table and what's wrong,
but don't say how to fix it.}
author_name: ade-bench-plugin
author_email: ""
difficulty: {easy|medium|hard}
tags:
- dbt
- {bug-category} # e.g., debugging, aggregation, join-logic
- {project_name}
test_setup: |-
dbt run --select {space-separated list of models that need to be rebuilt for testing}
solution_seeds:
- table_name: {the model/table name that should be compared}
variants:
- db_type: snowflake
db_name: {db_name}
project_type: dbt
project_name: {project_name}
project_dir: {see above}
Note: Snowflake variants do NOT use db_dir — the database is accessed via credentials in the Snowflake account. The db_name should match the Snowflake source database name that ADE-Bench will clone for the task.
This is a unified diff that introduces the bug into the working project. Generate it as the output of diff -u original.sql broken.sql for each file modified.
Format:
--- a/{relative_path_to_model}
+++ b/{relative_path_to_model}
@@ -{line},{count} +{line},{count} @@
{context lines}
-{original line (correct)}
+{mutated line (buggy)}
{context lines}
IMPORTANT: The patch must apply cleanly with patch -p1 from the project root. Use the correct relative paths. Include 3 lines of context around each change.
This is the reverse of the setup patch — it fixes the bug. It's what the sage agent applies to produce the correct output. Always name it changes.patch regardless of the variant's database type. The plugin generates single-variant tasks, so there is only ever one solution patch per task.
Format: Same as setup patch but with the lines swapped (the "buggy" line becomes the line to remove, the "correct" line becomes the line to add). The setup and solution patches are exact inverses of each other.
Note: ade-bench's cross-db convention reserves
changes.<db>.patch(e.g.,changes.snowflake.patch) for an additive delta that stacks on top ofchanges.patch, applied via a separateif [[ "$*" == *"--db-type=snowflake"* ]]block. Don't emit those — this plugin only generates single-variant tasks.
#!/bin/bash
patch -p1 < /app/setup/changes.patch
# Run dbt to materialize the broken state
dbt deps
dbt run
#!/bin/bash
patch -p1 < /sage/solutions/changes.patch
assertion_tests entry)If the matched pattern declares assertion_tests, render each entry into a .sql file under tests/ using the templates from the Assertion Tests section above. Resolve placeholders ({model}, {project_name}, {column}, {original_refs}, {original_sources}, {original_materialization}) before writing.
Filenames must not start with AUTO_. Use a stable convention like <task_id>_<purpose>.sql (e.g., mattermost001_refs_match.sql, mattermost001_materialization.sql).
If a pattern has no assertion_tests field, skip — equality alone is sufficient for that bug.
Tell the user:
Verify database access:
.duckdb file is accessible. If db_dir was set in task.yaml, it must contain <db_name>.duckdb. If db_dir was omitted (the default for bundled ade-bench projects), the harness expects ~/.ade-bench/shared/databases/duckdb/<db_name>.duckdb to exist — that file is downloaded by /ade-bench:setup.export SNOWFLAKE_ACCOUNT=...
export SNOWFLAKE_USER=...
export SNOWFLAKE_PASSWORD=...
export SNOWFLAKE_ROLE=...
export SNOWFLAKE_WAREHOUSE=...
The source database in db_name must exist and be accessible with these credentials. ADE-Bench will clone it per-task.Review the generated tests/ directory for any task whose pattern had assertion_tests. The custom assertions should reference real model/column names from the user's project. Filenames must not start with AUTO_.
Generate seeds and tests by running:
DuckDB:
ade run {task_id} --db duckdb --project-type dbt --agent sage --seed \
--tasks-dir {output_dir}/tasks
Snowflake:
ade run {task_id} --db snowflake --project-type dbt --agent sage \
--tasks-dir {output_dir}/tasks
Note: --seed export is only supported on DuckDB. For Snowflake tasks, run without --seed and write test SQL manually, or use DuckDB for seed generation if a DuckDB extract is available.
Validate by running again without --seed:
ade run {task_id} --db {db_type} --project-type dbt --agent sage \
--tasks-dir {output_dir}/tasks
Test with a real agent:
ade run {task_id} --db {db_type} --project-type dbt --agent claude --model claude-sonnet-4-5 \
--tasks-dir {output_dir}/tasks
This step is mandatory — run it for every generated task before presenting results to the user.
For each task, using the project at the configured project_dir:
Create a temp working copy:
cp -r {project_dir}/{project_name} /tmp/ade-bench-validate-{task_id}
Apply the setup patch (introduces the bug):
cd /tmp/ade-bench-validate-{task_id} && patch -p1 < {task_dir}/setup/changes.patch
If this fails, the patch has wrong context lines or paths. Fix the patch and retry.
Verify the mutation happened — read the patched file and confirm the buggy line is present.
Apply the solution patch (fixes the bug):
cd /tmp/ade-bench-validate-{task_id} && patch -p1 < {task_dir}/solutions/changes.patch
If this fails, fix and retry.
Verify the file is restored — diff against the original:
diff {project_dir}/{project_name}/{model_path} /tmp/ade-bench-validate-{task_id}/{model_path}
If there's a diff, the patches are not exact inverses. Fix and retry.
Clean up:
rm -rf /tmp/ade-bench-validate-{task_id}
Verify model names in solution_seeds and test_setup match actual dbt model names in the project.
Report results:
Patch validation:
{task_id}: setup ✓ solution ✓ roundtrip ✓
If any task fails, fix the patches before presenting to the user. Do not present broken tasks.
After patch validation passes, ask the user:
"All patches validated. Want me to run the full ADE-Bench harness to verify the tasks end-to-end? This requires Docker running and ADE-Bench installed."
If the user accepts, for each task run:
DuckDB:
ade run {task_id} --db duckdb --project-type dbt --agent sage --seed \
--tasks-dir {output_dir}/tasks
Snowflake:
ade run {task_id} --db snowflake --project-type dbt --agent sage \
--tasks-dir {output_dir}/tasks
Check results:
{original_refs} parsed incorrectly), or the assertion was written against the pre-mutation graph rather than the restored one.If a task fails, report the error and ask whether to fix or skip it.
For DuckDB tasks with --seed, run once more without --seed to confirm tests pass against the generated seeds.
assertion_tests, emit them. Patterns ship them because equality alone is insufficient or ambiguous for that bug class. Skipping them leaves bugs that can pass the harness while still being broken.npx claudepluginhub typedef-ai/ade-bench-plugin --plugin ade-benchSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.