Skill

plan-tasks

Interactively plan and generate ADE-Bench benchmark tasks from a dbt project. Explores the codebase with the user in a pair-planning loop — understanding the domain, reasoning about what makes good benchmarks, and building up a task plan incrementally. Use when you want high-quality, tailored benchmark tasks for your own dbt project.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ade-bench:plan-tasks <path-to-dbt-project> [custom instructions]

User invocable

Model invocable

Inline context

Default effort

Argument hint<path-to-dbt-project> [custom instructions]

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are pair-planning with the user to create benchmark tasks for ADE-Bench, a framework that evaluates AI agents on dbt and SQL tasks. Your job is to deeply understand the user's dbt project, reason about what bugs would make meaningful benchmarks, and build a task plan collaboratively.

SKILL.md

584 lines · ~7.2k tokens(exceeds 5k compaction limit)

Stats

LanguageShell

Stars0

MaintenanceExcellent

Last CommitMay 8, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

ADE-Bench Task Planner

How ADE-Bench Tasks Work

A working dbt project has a bug introduced via a patch (setup)
An agent receives a prompt describing the symptom — never the bug itself
The agent investigates and attempts to fix it
dbt tests and table comparison against seed CSVs determine pass/fail

You will plan tasks following this pattern, then generate the full scaffolding.

Step 1: Parse Arguments

The user invokes this skill with a path and optional free-form instructions.

/ade-bench:plan-tasks /path/to/project [custom instructions]

Parse:

First argument: Path to the dbt project directory (must contain dbt_project.yml).
--no-copy (optional flag): If present, the generated tasks will reference the original project in-place instead of copying it into the output directory. Use this for very large projects to avoid duplication. Default behavior is to copy.
Everything else: Custom instructions that shape the entire planning process. Examples:
- "focus on the revenue pipeline, I want to test if agents understand our ARR calculation"
- "3 hard tasks that test Snowflake-specific SQL"
- "easy tasks only, targeting the staging layer"
- (empty — no custom instructions)

Store the custom instructions and flags; they inform every phase.

Step 1.5: Verify ade-bench is Installed

Before any project-specific work, check whether ade-bench (the harness that actually runs the generated tasks) is available. The plugin can generate tasks without it, but the user can't run them without it.

Detection — in order:

command -v ade — fastest check. If it returns a path and ade --help succeeds, ade-bench is installed.
test -f ~/.ade-bench/pyproject.toml && grep -q 'name = "ade-bench"' ~/.ade-bench/pyproject.toml — fallback for cases where the user installed ade-bench but ~/.local/bin isn't on PATH yet.

If detected: continue to Step 2.

If missing: tell the user

"ade-bench (the harness that runs the tasks I'm about to generate) isn't installed. I can install it now at ~/.ade-bench (default), or you can pick a different location, or skip this and install it later — generation works either way, but you'll need ade-bench before you can run the tasks."

If the user wants to install: follow the steps in skills/setup/SKILL.md (the same skill the user can also invoke directly as /ade-bench:setup). Pass through the Snowflake context if you've already auto-detected a Snowflake project, so the setup skill knows whether to copy .env.example.

If the user skips: continue, but at the end of Step 8 (or Post-Generation Summary) include a reminder that ade-bench still needs to be installed before tasks can run.

Either way, proceed to Step 2 once the user has answered.

Step 2: Read Background Knowledge

Read all .md files in the knowledge/ directory relative to the plugin root (the directory containing .claude-plugin/). If you don't know the plugin root path, use Glob to find knowledge/base.md and derive it from there. Skip files named _template.md.

These files describe categories of bugs commonly found in dbt/SQL projects — what they are, when they're impactful, how difficulty varies by context, and how common they are in the real world. Use them as inspiration and reasoning material throughout the planning process. You are not constrained to these patterns — if you discover a bug opportunity specific to the user's project that doesn't match any archetype, pursue it.

knowledge/custom-tests.md is a special case — it describes when a bug needs a custom assertion test alongside the AUTO equality test, and gives templates for the common assertion archetypes (refs, sources, materialization, column existence, model existence). Read it once and refer back to it when deciding what tests to ship with each task (see Step 5 / "Decide on custom assertion tests").

Step 3: Initial Project Scan

Before engaging the user, do a quick scan to have informed questions. Read:

dbt_project.yml — project name, profile, materialization defaults
profiles.yml — database type (DuckDB or Snowflake), connection details
Model directory structure — ls the models/ tree to understand layers (staging, intermediate, marts, etc.)
packages.yml — dbt packages in use
Count of models per layer

Do NOT read every model file yet. The goal is to understand the project shape, not to scan exhaustively. You'll go deep in specific areas during the planning loop.

Step 3.5: Survey Existing Tasks for This Project

Generated tasks should add coverage, not duplicate bugs the user (or ade-bench's official task collection) has already exercised. Before planning, read what already exists so you can aim for novelty.

Where to look (skip a location if it doesn't exist):

ade-bench's official tasks/ dir — at <ade-bench-path>/tasks/. Use the path discovered in Step 1.5. If ade-bench isn't installed, skip this location.
The plugin's prior output for this project — if {output_dir}/tasks/ already exists from a previous run, read it too.

What to read. Glob for task.yaml files in each location. For each one, parse the variants block — if any variant.project_name matches the current dbt project name, this task targets the same project. Read it.

For each matching task, capture compactly:

task_id
description (one-line)
prompts[0].prompt (the agent-facing prompt)
tags
Setup-patch summary — read setup/changes.patch, note the file(s) changed and the type of mutation (e.g., "constructor_standings.sql L23: MAX → SUM"). Don't quote the entire diff; one line per hunk is enough.

If a task uses an external project copy (its project_dir doesn't match the directories you found), still include it — the project_name match is the canonical signal.

Compile a brief like:

Existing tasks for project {project_name} ({N} found):
- f1001 (medium): "Add a source model layer." | adds src_*.sql files, updates stg_* refs
- f1005 (medium): "Constructor points look too high." | int_constructor_standings.sql L18: MAX(points) → SUM(points)
- ...

Hold this brief in working memory for the entire planning loop.

If zero matching tasks were found, note "no existing tasks for {project_name}" and proceed normally — every task you generate is novel by definition.

Step 4: Intake Questions

Ask the user a short set of questions to scope the work. Ask them together in one turn. If the custom instructions already answer some of these, skip those and pre-fill from the instructions.

How many tasks do you want? (suggest a default based on project size — e.g., 3-5 for small projects, 5-10 for large ones)
What difficulty mix? (e.g., "all hard", "mix of easy/medium/hard", "2 easy, 2 medium, 1 hard")
What area of the project should I focus on? (mention the layers/domains you found in the scan — e.g., "I see models for payments, users, telemetry, and reporting — any priority?")
Any models or patterns to specifically include or exclude?
Confirm database type (auto-detected — "I see this project uses Snowflake. Should the tasks target Snowflake-specific SQL, or stick to portable ANSI SQL?")

Also confirm the output directory. Default: ./ade-bench-tasks relative to the dbt project directory.

Wait for the user's answers before proceeding.

Step 5: The Planning Loop

Now enter the iterative planning loop. You are pair-planning with the user — explore the code, build up the plan, and ask questions when you hit ambiguities.

Create the Plan File

Create {output_dir}/task-plan.md with an initial skeleton:

# Task Plan: {project_name}

## Configuration
- Project: {project_name}
- Database: {db_type}
- Tasks requested: {count}
- Difficulty mix: {mix}
- Focus area: {area}
- Custom instructions: {instructions or "none"}

## Exploration Notes
(findings will be captured here as planning progresses)

## Proposed Tasks
(tasks will be added here as candidates are identified)

The Loop

Repeat this cycle until the plan is complete:

Explore

Go deep on the focus area the user specified. Read the actual SQL of candidate models — understand what they do, not just what tokens they contain.

For large projects (>30 models in the focus area), spawn Explore sub-agents in parallel using the Agent tool:

Give each agent a specific area to investigate (e.g., "Read all models in models/intermediate/finance/ and report what each one does, what SQL patterns it uses, and what its upstream/downstream dependencies are")
Use 2-3 agents maximum, each with a clear non-overlapping scope
In the agent prompt, include the relevant knowledge base content so the agent can flag bug opportunities

For small projects or narrow focus areas, read the models directly — no agents needed.

When reading a model, think about:

What is this model's business purpose? (not just "it joins two tables" but "it calculates ARR by pro-rating subscription amounts across billing periods")
What would break meaningfully? (a bug that changes output in a testable, verifiable way)
What would test a real debugging skill? (tracing through the DAG, understanding domain semantics, knowing Snowflake functions)
Is this bug realistic? (could a real developer plausibly introduce this?)
Is this bug novel vs the existing tasks from Step 3.5? Same model + same mutation type = duplicate. Same model + meaningfully different mutation (different SQL token, different debugging texture) = OK. Different model = always OK. If your candidate overlaps a known task, either pick a different mutation type, target a different model, or drop the candidate. Prefer mutation types and models the existing tasks haven't touched.

Update the Plan

After each exploration round, immediately update {output_dir}/task-plan.md. Capture:

Exploration notes: what you learned about the models, the data flow, the business logic
Task candidates: as you identify them, add entries under "Proposed Tasks"

A task entry evolves through stages:

Early (exploring):

### Task 1: (exploring int_account_arr_seats)
- Investigating the ARR pro-ration logic, looks promising for a date function bug

Developing:

### Task 1: mattermost001 — Hard (draft)
- **Model**: int_account_arr_seats
- **Bug idea**: Change DATEADD('day', ...) to DATEADD('month', ...) in the pro-ration
- **Why it's good**: Tests Snowflake date fluency + domain understanding of billing periods
- **Open question**: Need to confirm this model has downstream consumers that would surface the symptom

Ready:

### Task 1: mattermost001 — Hard
- **Model**: int_account_arr_seats (models/intermediate/finance/int_account_arr_seats.sql)
- **Bug**: Change DATEADD('day', ...) to DATEADD('month', ...) on line 47
- **Prompt**: "Finance flagged that ARR calculations for mid-month subscription changes are wildly off. The numbers are way too high for accounts that started mid-billing-cycle."
- **Difficulty**: Hard — requires understanding the pro-ration business logic to know which date unit is correct
- **Rationale**: Tests Snowflake DATEADD fluency + domain understanding of ARR pro-ration. The agent must read the business logic, not just grep for syntax errors.
- **Models to rebuild**: int_account_arr_seats, mart_arr
- **Solution seed table**: int_account_arr_seats
- **Custom assertion tests**: none (value bug — equality catches it)
- **Novelty**: distinct from f1005/f1006 (those swap MAX↔SUM on cumulative points; this swaps date units in pro-ration logic — different mutation type, different debugging skill)
- **Status**: ready

Decide on custom assertion tests

For each task, decide whether the bug needs a custom assertion test in addition to the AUTO equality test the harness generates from solution_seeds. Decide based on the bug type, not by asking the user. The decision rule, from knowledge/custom-tests.md:

Value bug (wrong join, wrong filter, wrong aggregation, broken CASE, wrong COALESCE default, hardcoded value, missing DISTINCT, wrong date function, etc.) → no custom test. Equality catches it.
Wrong ref() target → emit a refs assertion (Archetype 1). Defensive: if the alternate ref happens to return identical data, equality won't catch the bug at all.
Materialization change (incremental → table, table → view, removed config) → emit a materialization assertion (Archetype 3). Equality usually passes — output rows are the same — so the assertion is the test.
Dropped or renamed column → emit a column-existence assertion (Archetype 4). Equality also catches this, but the named assertion produces a clearer failure message.
Source change → emit a sources assertion (Archetype 2), same logic as wrong ref.
Removed dbt test or constraint → output unchanged, equality cannot catch this. Either skip the bug or write a manifest-walking assertion (Tier 2 — harder, see archetype guidance).

When a custom test is needed, write it directly using the templates in knowledge/custom-tests.md. Use real names from the user's project — substitute MODEL_NAME, PROJECT_NAME, REF_1, etc. in the templates with the actual values. Record the test SQL (or a summary of it) in the task plan entry under Custom assertion tests so the user sees it during approval.

Ask the User

When you hit something you can't resolve from code alone, ask. Good questions:

"The int_opportunity_ext model has 23 JOINs. Is this realistic for your team or is it legacy code nobody touches?" (determines if it's a good benchmark target)
"I found two models that compute ARR differently. Which one is the source of truth?"
"For task 3, I could target the IFF() logic in stg_subscriptions (Snowflake-specific) or the LEFT JOIN in int_known_licenses (ANSI). Which tests a more valuable skill for your use case?"
"The staging models are pretty simple (just renames and casts). Should I skip those and focus on intermediate/mart models where the business logic lives?"

Question guidelines:

Never ask what you could find out by reading the code
Batch related questions together
Focus on: requirements, preferences, tradeoffs, domain knowledge
Scale depth to the task — a focused brief needs fewer rounds than a vague one

Convergence

The plan is ready when every task entry has:

Model (with file path)
Bug (specific mutation with line reference)
Prompt (symptom description, no diagnosis)
Difficulty (with reasoning)
Rationale (what skill does this test)
Models to rebuild (for test_setup)
Solution seed table
Custom assertion tests (or "none" with the rationale)
Novelty (closest existing task and why this one is distinct, or "no existing tasks for this project")
Status: ready

Present the complete plan to the user and ask for approval. The user may:

Approve: proceed to generation
Request changes: "make task 2 easier", "swap task 4 for something in the telemetry layer" → back into the loop
Add tasks: "add one more targeting incremental models" → back into the loop

Step 6: Generate Task Files

Only after the user approves the plan. For each task, generate the complete file structure.

Task ID Convention

Use the project name as prefix: {project_name}001, {project_name}002, etc.

Output Structure

With project copy (default):

{output_dir}/
├── task-plan.md                     # The plan (already exists from planning)
├── projects/
│   └── {project_name}/              # Full copy of the dbt project
└── tasks/
    └── {task_id}/
        ├── task.yaml
        ├── setup.sh
        ├── solution.sh
        ├── setup/
        │   └── changes.patch
        ├── solutions/
        │   └── changes.patch
        ├── tests/                   # Custom assertion tests written during planning
        │                            # (AUTO_*.sql added later by the harness)
        └── seeds/                   # Populated after seed generation

With --no-copy (reference in-place):

{output_dir}/
├── task-plan.md
└── tasks/
    └── {task_id}/
        └── (same task files, but task.yaml points to the original project location)

task.yaml

The project_dir field depends on whether --no-copy was used:

Default (copy): points to {output_dir}/projects (where the project was copied)
--no-copy: points to the parent directory of the original project (e.g., if the project is at /path/to/my-project, project_dir is /path/to)

task_id: {task_id}
status: ready
description: "{one-line description of what the bug is — this is for task authors, not agents}"
prompts:
  - key: base
    prompt: |-
      {the symptom prompt from the plan — what the agent sees.
       Must sound like a real user report. Must NOT reveal the bug.}
author_name: ade-bench-plugin
author_email: ""
difficulty: {easy|medium|hard}
tags:
  - dbt
  - {relevant tags}
  - {project_name}

test_setup: |-
  dbt run --select {models to rebuild from the plan}

solution_seeds:
  - table_name: {solution seed table from the plan}

variants:
- db_type: {duckdb|snowflake}
  db_name: {db_name}
  project_type: dbt
  project_name: {project_name}
  project_dir: {see above}
  # Add `db_dir: {absolute_path_to_db_directory}` ONLY for DuckDB and ONLY if the user
  # supplied an explicit DuckDB directory. If omitted, the harness falls back to
  # `~/.ade-bench/shared/databases/duckdb/{db_name}.duckdb` — which is what you want
  # for any of the bundled ade-bench projects (airbnb, f1, quickbooks, etc.).
  # Never set `db_dir` for Snowflake.

setup/changes.patch

A unified diff that introduces the bug into the working project. Generate from the specific mutation identified in the plan.

Before writing the patch, read the target model file and verify:

The context lines match the actual file content exactly
The line numbers are correct
The patch applies cleanly with patch -p1 from the project root

Format:

--- a/{relative_path_to_model}
+++ b/{relative_path_to_model}
@@ -{line},{count} +{line},{count} @@
 {3 lines of context}
-{original line (correct)}
+{mutated line (buggy)}
 {3 lines of context}

solutions/changes.patch

The exact reverse of the setup patch — fixes the bug. Always name it changes.patch.

Note: ade-bench reserves changes.<db>.patch (e.g., changes.snowflake.patch) for an additive delta used in cross-db tasks that target both DuckDB and Snowflake from a single source project. Don't emit that — this plugin generates single-variant tasks.

setup.sh

#!/bin/bash
patch -p1 < /app/setup/changes.patch

dbt deps
dbt run

solution.sh

#!/bin/bash
patch -p1 < /sage/solutions/changes.patch

tests/.sql (one per "Custom assertion test" in the task plan)

If the task plan entry has a custom assertion test, write the SQL to {task_dir}/tests/<filename>.sql. Filenames must NOT start with AUTO_ (that prefix is reserved by the harness for auto-generated equality/existence tests). Use a stable convention like {task_id}_<purpose>.sql (e.g., mattermost001_refs.sql, mattermost001_materialization.sql).

Use the templates in knowledge/custom-tests.md and substitute the user's real model/project/column names. Include -- depends_on: {{ ref('<model>') }} at the top so dbt builds the model before running the test.

If a task has no custom assertion tests (value bug, equality alone is sufficient), leave {task_dir}/tests/ empty — the harness will populate AUTO_*.sql files at first run.

Copy the Project

If --no-copy was NOT used (default):

Copy the user's dbt project into {output_dir}/projects/{project_name}/:

mkdir -p {output_dir}/projects
cp -r {project_path} {output_dir}/projects/{project_name}

This makes the output self-contained — the tasks, project, and plan are all in one directory that can be moved, shared, or committed to version control.

If --no-copy was used:

Do not copy the project. The task.yaml project_dir will point to the original project's parent directory. Warn the user:

"Tasks reference your project in-place at {project_path}. The output directory is not self-contained — if the project moves, the tasks will break."

Step 7: Validate Patches (mandatory)

This step is mandatory — run it for every generated task before presenting results to the user. It catches malformed patches without requiring Docker or ADE-Bench.

Determine the project location:

Default (copy): {output_dir}/projects/{project_name}
--no-copy: the original project path

For each task, using the project at that location:

Create a temp working copy of the project:

cp -r {output_dir}/projects/{project_name} /tmp/ade-bench-validate-{task_id}

Apply the setup patch (introduces the bug):
```
cd /tmp/ade-bench-validate-{task_id} && patch -p1 < {output_dir}/tasks/{task_id}/setup/changes.patch
```
If this fails, the patch has wrong context lines or paths. Fix the patch and retry.
Verify the mutation happened — read the patched file and confirm the buggy line is present.
Apply the solution patch (fixes the bug):
```
cd /tmp/ade-bench-validate-{task_id} && patch -p1 < {output_dir}/tasks/{task_id}/solutions/changes.patch
```
If this fails, the solution patch doesn't match the post-setup state. Fix and retry.
Verify the file is restored — diff the patched file against the original. They must be identical:
```
diff {output_dir}/projects/{project_name}/{model_path} /tmp/ade-bench-validate-{task_id}/{model_path}
```
If there's a diff, the setup and solution patches are not exact inverses. Fix and retry.

Clean up:

rm -rf /tmp/ade-bench-validate-{task_id}

If any task fails validation, fix the patches before proceeding. Do not present broken tasks to the user. Common fixes:

Wrong context lines → re-read the model file and regenerate the patch
Wrong line numbers → the @@ hunk header is off, recalculate from the actual file
Patches not inverse → one of the two patches has an extra or missing line

Report results to the user:

Patch validation:
  mattermost001: setup ✓ solution ✓ roundtrip ✓
  mattermost002: setup ✓ solution ✓ roundtrip ✓
  mattermost003: setup ✓ solution ✓ roundtrip ✓

Step 8: Harness Validation (optional, recommended)

After patch validation passes, ask the user:

"All patches validated. Want me to run the full ADE-Bench harness to verify the tasks end-to-end? This requires Docker running and ADE-Bench installed. I'll run the sage agent to confirm each task's setup/solution cycle works and generate seed CSVs."

If the user declines, skip to post-generation instructions. If they accept:

For each task, run the harness

DuckDB tasks:

ade run {task_id} --db duckdb --project-type dbt --agent sage --seed \
  --tasks-dir {output_dir}/tasks

Snowflake tasks:

ade run {task_id} --db snowflake --project-type dbt --agent sage \
  --tasks-dir {output_dir}/tasks

Note: --seed CSV export is only supported on DuckDB. For Snowflake, run without --seed.

Interpret results

Check the harness output for each task:

PASS: The sage agent applied the solution, dbt ran successfully, and tests passed. The task is valid.
FAIL — setup script error: The setup patch applied but dbt run failed in the broken state. This is expected for some bug types (e.g., TRY_TO_* removal causes compilation errors). Verify the error is the intended symptom.
FAIL — solution script error: The solution patch didn't fix the problem. The patch or the solution is wrong. Go back into the planning loop to fix.
FAIL — test failure: The sage agent's output didn't match the expected seeds (AUTO equality test) or a custom assertion test returned rows. For equality failures: the seed CSVs may need to be regenerated, or the solution doesn't fully fix the bug. For custom-assertion failures: re-read the assertion against the post-solution graph state — likely the assertion was written against the wrong placeholder values.

If a task fails harness validation

Report the failure to the user with the error details, then ask:

"Task {task_id} failed harness validation: {error summary}. Want me to diagnose and fix it, or skip this task?"

If fixing: investigate the error, update the patches or task.yaml as needed, and re-run. This may require going back into the planning loop to adjust the bug or switch to a different model.

Report final results:

Harness validation:
  mattermost001: PASS (seeds generated)
  mattermost002: PASS (seeds generated)
  mattermost003: FAIL — solution didn't compile (investigating...)

After harness validation

If seeds were generated (DuckDB with --seed), run once more without --seed to confirm tests pass against the generated seeds:

ade run {task_id} --db duckdb --project-type dbt --agent sage \
  --tasks-dir {output_dir}/tasks

Post-Generation Summary

After all validation is complete, tell the user:

What was generated: list of task IDs with difficulty and target model
Validation status: patch verification and harness results (if run)
Where the files are: {output_dir}/tasks/, {output_dir}/projects/, {output_dir}/task-plan.md
How to run with a real agent:

DuckDB:

ade run {task_id} --db duckdb --project-type dbt --agent claude --model claude-sonnet-4-5 \
  --tasks-dir {output_dir}/tasks

Snowflake:

# Requires: SNOWFLAKE_ACCOUNT, SNOWFLAKE_USER, SNOWFLAKE_PASSWORD, SNOWFLAKE_ROLE, SNOWFLAKE_WAREHOUSE
ade run {task_id} --db snowflake --project-type dbt --agent claude --model claude-sonnet-4-5 \
  --tasks-dir {output_dir}/tasks

How to run all tasks at once:

ade run --dataset-config {output_dir}/tasks --db {db_type} --project-type dbt --agent claude \
  --tasks-dir {output_dir}/tasks

Important Guidelines

You are pair-planning, not executing a pipeline. Adapt to what you learn. If the first model you explore isn't a good candidate, move on. If the user redirects you, follow.
Understand before proposing. Read the model SQL and understand the business logic before suggesting a bug. "This model has a LEFT JOIN" is not understanding. "This model joins subscriptions to accounts with a LEFT because some accounts have no active subscription, and the downstream ARR mart counts all accounts" is understanding.
Never reveal the bug in the agent prompt. The prompt describes a symptom as a real user would report it. "Revenue numbers look wrong" — not "you used SUM instead of MAX."
One bug per task. Keep tasks focused and unambiguous.
Prefer minimal patches. A 1-2 line diff is ideal. The complexity should be in the debugging, not in the fix.
Bugs must be verifiable. The mutation must cause either (a) a measurable difference in the output table that the AUTO equality test catches against a seed CSV, or (b) a structural change a custom assertion test catches. Bugs that produce neither — e.g., removing a dbt test that no longer affects anything observable — aren't valid benchmark targets.
Don't modify the user's original project. All mutations happen via patches applied inside Docker containers at runtime.
Supported databases: DuckDB and Snowflake. Don't generate variants for other databases.
Update the plan file after every exploration round. Don't wait until the end.
End each turn by either asking the user a question or presenting the plan for approval. Don't stop mid-exploration with no interaction.
Don't duplicate existing tasks. Step 3.5 builds the dedupe brief. Same {model, mutation type} as an existing task = duplicate. The point of generating user-tailored tasks is added coverage over what already exists; redundant tasks dilute the benchmark and waste the user's review time.

plan-tasks

Invocation

Context Preview

SKILL.md

plan-tasks

Invocation

Context Preview

SKILL.md

ADE-Bench Task Planner

How ADE-Bench Tasks Work

Step 1: Parse Arguments

Step 1.5: Verify ade-bench is Installed

Step 2: Read Background Knowledge

Step 3: Initial Project Scan

Step 3.5: Survey Existing Tasks for This Project

Step 4: Intake Questions

Step 5: The Planning Loop

Create the Plan File

The Loop

Explore

Update the Plan

Decide on custom assertion tests

Ask the User

Convergence

Step 6: Generate Task Files

Task ID Convention

Output Structure

task.yaml

setup/changes.patch

solutions/changes.patch

setup.sh

solution.sh

tests/.sql (one per "Custom assertion test" in the task plan)

Copy the Project

Step 7: Validate Patches (mandatory)

Step 8: Harness Validation (optional, recommended)

For each task, run the harness

Interpret results

If a task fails harness validation

After harness validation

Post-Generation Summary

Important Guidelines

Similar Skills

ADE-Bench Task Planner

How ADE-Bench Tasks Work

Step 1: Parse Arguments

Step 1.5: Verify ade-bench is Installed

Step 2: Read Background Knowledge

Step 3: Initial Project Scan

Step 3.5: Survey Existing Tasks for This Project

Step 4: Intake Questions

Step 5: The Planning Loop

Create the Plan File

The Loop

Explore

Update the Plan

Decide on custom assertion tests

Ask the User

Convergence

Step 6: Generate Task Files

Task ID Convention

Output Structure

task.yaml

setup/changes.patch

solutions/changes.patch

setup.sh

solution.sh

tests/.sql (one per "Custom assertion test" in the task plan)

Copy the Project

Step 7: Validate Patches (mandatory)

Step 8: Harness Validation (optional, recommended)

For each task, run the harness

Interpret results

If a task fails harness validation

After harness validation

Post-Generation Summary

Important Guidelines

Similar Skills