From prompt-engineer
Measures latency, token cost, and accuracy across LLM skill/prompt variants. Runs paired evaluations, audits token-budget compliance, and flags insufficient sample sizes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/prompt-engineer:skill-benchmarkingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You have deep expertise in benchmarking LLM skills and prompts. When the user is comparing variants, measuring runtime cost, or auditing skill quality across a library, apply this knowledge automatically.
You have deep expertise in benchmarking LLM skills and prompts. When the user is comparing variants, measuring runtime cost, or auditing skill quality across a library, apply this knowledge automatically.
Latency measurement:
Cost and token accounting:
Accuracy and quality benchmarking:
Skill-library hygiene:
When assisting with benchmarking tasks:
Benchmark numbers and statistical verdicts produced through this plugin reflect the eval set, model version, and methodology used. Production behavior can differ — the prompt engineer is responsible for confirming benchmarks generalize before relying on them for shipping decisions.
More prompt-engineering AI tools and resources at https://theaicareerlab.com/professions/prompt-engineer
npx claudepluginhub alexclowe/awesome-claude-cowork-plugins --plugin prompt-engineerCreates evals for skills and runs the benchmark harness to measure whether a skill improves model behavior. Use when testing, benchmarking, or evaluating a skill's quality.
Evaluates Claude Agent Skills quality via static analysis scorecard, A/B testing, and multi-model benchmarks. Use for measuring activation rates and optimizing descriptions.
Evaluates and benchmarks Agent Skills using static analysis and A/B testing. Measures activation accuracy, quality scorecards, and description optimization.