From skill-forge
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-forge:skill-forge-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run structured evaluations against Claude Code skills to verify triggering,
Run structured evaluations against Claude Code skills to verify triggering, correctness, and quality using a multi-agent pipeline.
Accept eval definitions from:
evals/evals.json or user-specified fileEval set JSON schema:
{
"skill_name": "my-skill",
"skill_path": "./my-skill",
"evals": [
{
"eval_id": 0,
"eval_name": "descriptive-name",
"prompt": "The user's task prompt",
"input_files": [],
"assertions": [
{
"name": "output-has-score",
"check": "Output contains a numeric score between 0-100",
"weight": 1.0
}
],
"should_trigger": true
}
]
}
If no eval set exists, generate one:
python scripts/generate_eval_set.py <skill-path> to produce a starter setCreate the eval workspace outside the skill directory to avoid confusing eval artifacts with skill files. Use a sibling directory or a dedicated location:
eval-workspace/
iteration-1/
eval-0/
eval_metadata.json # Assertions and config for this eval
with_skill/
outputs/ # Skill execution outputs
timing.json # Token count + duration
grading.json # Assertion results + evidence
baseline/
outputs/
timing.json
grading.json
eval-1/
eval_metadata.json
with_skill/
outputs/
timing.json
grading.json
baseline/
outputs/
timing.json
grading.json
benchmark.json # Aggregated metrics
benchmark.md # Human-readable report
For each eval directory, create eval_metadata.json from the eval set entry:
{
"eval_id": 0,
"eval_name": "descriptive-name",
"prompt": "The user's task prompt",
"assertions": [...],
"should_trigger": true
}
For each eval in the set, spawn two parallel runs:
With-skill run (delegate to agents/skill-forge-executor.md):
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the assertions check>
Baseline run (delegate to agents/skill-forge-executor.md):
Save timing data to timing.json in each run directory:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
Delegate to agents/skill-forge-grader.md for each completed run:
grading.json per run:{
"eval_id": 0,
"run_type": "with_skill",
"assertions": [
{
"name": "output-has-score",
"passed": true,
"evidence": "Found score: 87/100 on line 14"
}
],
"pass_rate": 1.0
}
Run python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>
This produces benchmark.json and benchmark.md with:
Delegate to agents/skill-forge-analyzer.md to:
Generate a summary report:
# Eval Report: [skill-name] — Iteration [N]
## Overall
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | X% | Y% | +Z% |
| Avg Time | Xs | Ys | -Zs |
| Avg Tokens | X | Y | -Z |
## Per-Eval Results
| Eval | With Skill | Baseline | Status |
|------|-----------|----------|--------|
| eval-0 | PASS | FAIL | Improved |
| eval-1 | PASS | PASS | Maintained |
## Patterns & Insights
[From analyzer agent]
## Recommendations
[Specific improvements based on failures]
Save user feedback to feedback.json:
{
"reviews": [
{
"run_id": "eval-0-with_skill",
"feedback": "the chart is missing axis labels",
"timestamp": "2026-03-06T12:00:00Z"
}
],
"status": "complete"
}
Pass feedback to /skill-forge evolve for the next iteration.
For rigorous A/B testing between skill versions:
agents/skill-forge-comparator.mdeval-<ID>/with_skill/outputs/ and eval-<ID>/baseline/outputs/eval_metadata.json"timed_out": true in timing.jsonerror.txt in the run directory and continue with remaining evals"passed": null with evidence explaining whyBefore marking an eval run as complete:
npx claudepluginhub agricidaniel/skill-forgeTests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Executes skill evaluations against test cases, scores outputs with judges, and reports results. Use when testing a skill, benchmarking, detecting regressions, or verifying changes.
Evaluates a skill's effectiveness by running behavioral test cases and grading results against assertions. Use to validate improvements, benchmark against baselines, or create eval cases.