Skill

skill-forge-eval

Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.

Python

Markdown

testing

code-quality

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/skill-forge:skill-forge-eval

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Run structured evaluations against Claude Code skills to verify triggering,

SKILL.md

233 lines · ~1.7k tokens

Stats

LanguagePython

Stars58

Forks28

MaintenanceExcellent

Last CommitApr 10, 2026

Actions

View Source View Plugin View on GitHub View README

Skill Evaluation Pipeline

Run structured evaluations against Claude Code skills to verify triggering, correctness, and quality using a multi-agent pipeline.

Process

Step 1: Define Eval Set

Accept eval definitions from:

Path to eval set JSON: evals/evals.json or user-specified file
Inline prompts: User provides eval queries directly
Auto-generated: Generate from skill description (see Step 1b)

Eval set JSON schema:

{
  "skill_name": "my-skill",
  "skill_path": "./my-skill",
  "evals": [
    {
      "eval_id": 0,
      "eval_name": "descriptive-name",
      "prompt": "The user's task prompt",
      "input_files": [],
      "assertions": [
        {
          "name": "output-has-score",
          "check": "Output contains a numeric score between 0-100",
          "weight": 1.0
        }
      ],
      "should_trigger": true
    }
  ]
}

Step 1b: Auto-Generate Eval Set

If no eval set exists, generate one:

Read the skill's SKILL.md description and instructions
Run python scripts/generate_eval_set.py <skill-path> to produce a starter set
Present the generated set to the user for review and editing
User approves or modifies before proceeding

Step 2: Set Up Workspace

Create the eval workspace outside the skill directory to avoid confusing eval artifacts with skill files. Use a sibling directory or a dedicated location:

eval-workspace/
  iteration-1/
    eval-0/
      eval_metadata.json        # Assertions and config for this eval
      with_skill/
        outputs/                # Skill execution outputs
        timing.json             # Token count + duration
        grading.json            # Assertion results + evidence
      baseline/
        outputs/
        timing.json
        grading.json
    eval-1/
      eval_metadata.json
      with_skill/
        outputs/
        timing.json
        grading.json
      baseline/
        outputs/
        timing.json
        grading.json
    benchmark.json              # Aggregated metrics
    benchmark.md                # Human-readable report

For each eval directory, create eval_metadata.json from the eval set entry:

{
  "eval_id": 0,
  "eval_name": "descriptive-name",
  "prompt": "The user's task prompt",
  "assertions": [...],
  "should_trigger": true
}

Step 3: Execute Eval Runs

For each eval in the set, spawn two parallel runs:

With-skill run (delegate to agents/skill-forge-executor.md):

Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the assertions check>

Baseline run (delegate to agents/skill-forge-executor.md):

For new skills: run without the skill loaded
For improved skills: run with the previous version (snapshot it first)

Save timing data to timing.json in each run directory:

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

Step 4: Grade Results

Delegate to agents/skill-forge-grader.md for each completed run:

Grade against assertions defined in eval_metadata.json
Save results to grading.json per run:

{
  "eval_id": 0,
  "run_type": "with_skill",
  "assertions": [
    {
      "name": "output-has-score",
      "passed": true,
      "evidence": "Found score: 87/100 on line 14"
    }
  ],
  "pass_rate": 1.0
}

Step 5: Aggregate and Analyze

Run python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>
This produces benchmark.json and benchmark.md with:
- Pass rate per eval (with_skill vs baseline)
- Average time and token usage
- Improvement ratio (with_skill / baseline)
Delegate to agents/skill-forge-analyzer.md to:
- Surface patterns that aggregate stats might hide
- Identify consistently failing assertion types
- Flag regressions from previous iterations

Step 6: Present Results

Generate a summary report:

# Eval Report: [skill-name] — Iteration [N]

## Overall
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | X% | Y% | +Z% |
| Avg Time | Xs | Ys | -Zs |
| Avg Tokens | X | Y | -Z |

## Per-Eval Results
| Eval | With Skill | Baseline | Status |
|------|-----------|----------|--------|
| eval-0 | PASS | FAIL | Improved |
| eval-1 | PASS | PASS | Maintained |

## Patterns & Insights
[From analyzer agent]

## Recommendations
[Specific improvements based on failures]

Step 7: Collect Feedback

Save user feedback to feedback.json:

{
  "reviews": [
    {
      "run_id": "eval-0-with_skill",
      "feedback": "the chart is missing axis labels",
      "timestamp": "2026-03-06T12:00:00Z"
    }
  ],
  "status": "complete"
}

Pass feedback to /skill-forge evolve for the next iteration.

Advanced: Blind Comparison

For rigorous A/B testing between skill versions:

Delegate to agents/skill-forge-comparator.md
Pass two directories: eval-<ID>/with_skill/outputs/ and eval-<ID>/baseline/outputs/
Comparator assigns random labels (Version A / Version B) so it cannot know which is new
Rates each output on assertion criteria from eval_metadata.json
Returns preference scores without knowing which is "new" vs "old"

Error Handling

Executor timeout: If a run exceeds 5 minutes, terminate and mark as "timed_out": true in timing.json
Executor failure: If a run crashes, save the error to error.txt in the run directory and continue with remaining evals
Grading failure: If grading cannot determine pass/fail, mark assertion as "passed": null with evidence explaining why
Missing files: If timing.json or grading.json is missing after a run, flag the eval as incomplete in the report
Partial completion: Always aggregate and report whatever results are available — do not block on one failed eval

Quality Gates

Before marking an eval run as complete:

All evals executed (with_skill + baseline)
Timing data captured for every run
All assertions graded with evidence
Benchmark aggregated with pass rate, time, tokens
Analyzer patterns documented
Results presented to user

skill-forge-eval

Popularity

Invocation

Context Preview

SKILL.md

skill-forge-eval

Popularity

Invocation

Context Preview

SKILL.md

Skill Evaluation Pipeline

Process

Step 1: Define Eval Set

Step 1b: Auto-Generate Eval Set

Step 2: Set Up Workspace

Step 3: Execute Eval Runs

Step 4: Grade Results

Step 5: Aggregate and Analyze

Step 6: Present Results

Step 7: Collect Feedback

Advanced: Blind Comparison

Error Handling

Quality Gates

Similar Skills

Skill Evaluation Pipeline

Process

Step 1: Define Eval Set

Step 1b: Auto-Generate Eval Set

Step 2: Set Up Workspace

Step 3: Execute Eval Runs

Step 4: Grade Results

Step 5: Aggregate and Analyze

Step 6: Present Results

Step 7: Collect Feedback

Advanced: Blind Comparison

Error Handling

Quality Gates

Similar Skills