Skill

skill-forge-benchmark

Benchmarks Claude Code skill performance via multiple trials per eval, tracking pass rate, execution time, token usage, and variance. Aggregates to benchmark.json and generates version comparison reports. Use for 'benchmark skill' or performance tracking queries.

Python

testing

developer-tools

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/skill-forge:skill-forge-benchmark

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Measure and compare skill performance across iterations with statistical

SKILL.md

172 lines · ~1.4k tokens

Stats

LanguagePython

Stars58

Forks28

MaintenanceExcellent

Last CommitApr 10, 2026

Actions

View Source View Plugin View on GitHub View README

skill-forge-benchmark

Popularity

Invocation

Context Preview

SKILL.md

skill-forge-benchmark

Popularity

Invocation

Context Preview

SKILL.md

Skill Benchmarking & Performance Tracking

Process

Step 1: Define Benchmark Configuration

Step 2: Execute Benchmark Runs

Step 3: Aggregate Results

Step 4: Compare with Previous Iterations

Step 5: Generate Benchmark Report

Step 6: Threshold Gating

Error Handling

Integration with Other Sub-Skills

Similar Skills

Skill Benchmarking & Performance Tracking

Process

Step 1: Define Benchmark Configuration

Step 2: Execute Benchmark Runs

Step 3: Aggregate Results

Step 4: Compare with Previous Iterations

Step 5: Generate Benchmark Report

Step 6: Threshold Gating

Error Handling

Integration with Other Sub-Skills

Similar Skills