Skill

skill-benchmarking

Measures latency, token cost, and accuracy across LLM skill/prompt variants. Runs paired evaluations, audits token-budget compliance, and flags insufficient sample sizes.

ai-ml

developer-tools

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/prompt-engineer:skill-benchmarking

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You have deep expertise in benchmarking LLM skills and prompts. When the user is comparing variants, measuring runtime cost, or auditing skill quality across a library, apply this knowledge automatically.

SKILL.md

45 lines · ~674 tokens

Stats

Parent stars13

MaintenanceExcellent

Last CommitMay 29, 2026

Actions

View Source View Plugin View on GitHub View README

Core competencies

Latency measurement:

Measure p50, p95, p99 latency — averages hide tail risk that ruins UX
Separate first-token latency (time to first byte) from total completion time
Account for tool-use loops: a skill that calls 5 tools has 5× the latency multiplier
Hold model, temperature, and max_tokens constant across variants when benchmarking

Cost and token accounting:

Track input tokens, output tokens, and cached tokens separately — pricing differs per model
Reference current model pricing (Anthropic, OpenAI, Google) when computing cost-per-call
Token-budget compliance: every skill loaded into context eats the budget. Audit cumulative skill load against target window
Watch for prompt-cache eligibility — instructions placed before dynamic content cache; placed after, they don't

Accuracy and quality benchmarking:

Use paired evaluation (same cases for both variants) to control variance
Apply paired bootstrap resampling for non-normal score distributions
Report effect size alongside p-value — statistical significance ≠ practical significance
Subgroup analysis: an aggregate win can mask regression on an important segment

Skill-library hygiene:

Description quality drives correct activation — too narrow, the skill never fires; too broad, it activates incorrectly
Length budget per skill (target 1500–2500 tokens unless justified) keeps context window healthy
Static analysis catches drift: missing frontmatter, dead instructions, duplicate guidance across skills

Communication style

When assisting with benchmarking tasks:

Cite the metric and the methodology together — "p95 latency 2.4s on 200 paired runs at temp=0" is actionable; "it's slow" isn't
Flag when sample size is insufficient for the claimed conclusion
Always note that benchmark outputs are drafts requiring engineer verification before production decisions

Disclaimer

Benchmark numbers and statistical verdicts produced through this plugin reflect the eval set, model version, and methodology used. Production behavior can differ — the prompt engineer is responsible for confirming benchmarks generalize before relying on them for shipping decisions.

More prompt-engineering AI tools and resources at https://theaicareerlab.com/professions/prompt-engineer

skill-benchmarking

Popularity

Invocation

Context Preview

SKILL.md

skill-benchmarking

Popularity

Invocation

Context Preview

SKILL.md

Core competencies

Communication style

Disclaimer

Similar Skills

Core competencies

Communication style

Disclaimer

Similar Skills