From skill-forge
Benchmarks Claude Code skill performance via multiple trials per eval, tracking pass rate, execution time, token usage, and variance. Aggregates to benchmark.json and generates version comparison reports. Use for 'benchmark skill' or performance tracking queries.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skill-forge:skill-forge-benchmarkThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Measure and compare skill performance across iterations with statistical
Measure and compare skill performance across iterations with statistical rigor using multiple trials, variance analysis, and trend tracking.
Accept configuration as:
evals/evals.json (from /skill-forge eval)Benchmark config schema:
{
"skill_name": "my-skill",
"skill_path": "./my-skill",
"eval_set_path": "./evals/evals.json",
"trials_per_eval": 3,
"baseline_type": "no_skill",
"previous_benchmark": null,
"thresholds": {
"min_pass_rate": 0.8,
"max_avg_tokens": 100000,
"max_avg_duration_seconds": 120,
"min_improvement_ratio": 1.0
}
}
For each eval, run trials_per_eval times (default: 3) to get reliable metrics:
timing.json and grading.jsonUse agents/skill-forge-executor.md for parallel execution where possible.
Run python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>:
Output benchmark.json schema:
{
"skill_name": "my-skill",
"iteration": 1,
"timestamp": "2026-03-06T12:00:00Z",
"summary": {
"total_evals": 10,
"with_skill": {
"pass_rate": 0.87,
"pass_rate_std": 0.05,
"avg_tokens": 45000,
"avg_duration_seconds": 34.2
},
"baseline": {
"pass_rate": 0.60,
"pass_rate_std": 0.08,
"avg_tokens": 62000,
"avg_duration_seconds": 52.1
},
"improvement_ratio": 1.45,
"token_savings_ratio": 0.73,
"time_savings_ratio": 0.66
},
"per_eval": [
{
"eval_id": 0,
"eval_name": "basic-trigger",
"with_skill": {"pass_rate": 1.0, "avg_tokens": 30000, "avg_duration_seconds": 20.1},
"baseline": {"pass_rate": 0.67, "avg_tokens": 50000, "avg_duration_seconds": 45.0},
"trials": 3
}
],
"thresholds_met": {
"min_pass_rate": true,
"max_avg_tokens": true,
"max_avg_duration_seconds": true,
"min_improvement_ratio": true
}
}
If previous_benchmark is provided or prior iteration-<N-1> exists:
benchmark.json# Benchmark Report: [skill-name]
## Iteration [N] vs [N-1]
### Summary
| Metric | Current | Previous | Delta | Threshold | Status |
|--------|---------|----------|-------|-----------|--------|
| Pass Rate | 87% | 78% | +9% | >= 80% | PASS |
| Avg Tokens | 45K | 52K | -13% | <= 100K | PASS |
| Avg Time | 34s | 41s | -17% | <= 120s | PASS |
| Improvement | 1.45x | 1.30x | +0.15x | >= 1.0x | PASS |
### Regressions (Action Required)
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-5 | PASS | FAIL | Output missing required section |
### Improvements
| Eval | Previous | Current | Notes |
|------|----------|---------|-------|
| eval-3 | FAIL | PASS | Error handling now works |
### Per-Eval Detail
[Full breakdown table]
### Variance Analysis
| Eval | Pass Rate | Std Dev | Trials | Reliability |
|------|-----------|---------|--------|-------------|
| eval-0 | 100% | 0.00 | 3 | High |
| eval-1 | 67% | 0.47 | 3 | Low (investigate) |
### Recommendations
[Based on regressions, low-reliability evals, and threshold failures]
If any threshold fails:
/skill-forge evolve to address issues"trials_completed" vs "trials_requested" in per-eval results"unreliable" in the reportnpx claudepluginhub agricidaniel/skill-forgeTests and benchmarks Claude Code skills empirically via evaluation-driven development. Compares skill vs baseline performance using pass rates, timing, token metrics in quick workflow or 7-phase full pipeline.
Runs evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns sub-agents for parallel execution and generates JSON reports.
Evaluates skill output quality via assertion-based grading, blind before/after comparison, and variance analysis across 3 runs per scenario. Use for benchmarking, comparing skill versions, or triggered by /skill-eval.