Skip to main content

/

/

Stats

Actions

Tags

Stats

Actions

Tags

ClaudePluginHub

Community directory for discovering and installing Claude Code plugins.

Find plugins for your project

AI-powered recommendations based on your stack.

Product

Browse Plugins
Marketplaces
Pricing
About
Contact

Resources

Learning Center
Blog
Weekly Digest
Claude Code Docs
Plugin Guide
Plugin Reference
Plugin Marketplaces

Community

Browse on GitHub
Get Support

Legal

Terms of Service
Privacy Policy

Browse · Plugins · Top Plugins · Marketplaces · Components · Technologies · Skills · Agents · Commands · Hooks · MCP Servers · LSP Servers · Output Styles · Themes · Monitors

Categories · Productivity · Development · Testing · Deployment · Security · Documentation · Data · Utilities

© 2025 ClaudePluginHub

Community Maintained · Not affiliated with Anthropic

ClaudePluginHub

ClaudePluginHub

Tools Learn Pricing

Search everything...

designing-llm-evals | llm-evals

Home
Skills
llm-evals
designing-llm-evals

Skill

designing-llm-evals

Use when building or reviewing an evaluation for an LLM feature — assembling a representative test set, choosing pass criteria (exact match, programmatic checks, rubric, or LLM-as-judge), and catching regressions. Use when asking "how do I know this prompt or model change is better?"

Popularity

Parent stars

1

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/llm-evals:designing-llm-evals

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

If you can't measure it, you're tuning by vibes. Decide how you'll know the output is good

SKILL.md

47 lines · ~535 tokens

Stats

LanguageTypeScript

Parent stars1

MaintenanceGood

Last CommitJun 12, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

LanguageTypeScript

Parent stars1

MaintenanceGood

Last CommitJun 12, 2026

Actions

View Source View Plugin View on GitHub View README

Designing LLM evals

If you can't measure it, you're tuning by vibes. Decide how you'll know the output is good before you start changing prompts or models.

Build a real test set

Draw cases from real inputs, including edge cases and known past failures. A handful of representative cases beats a thousand synthetic near-duplicates.
Keep a holdout you don't tune against, so you can detect overfitting to the eval.
Label the expected behavior (exact answer, or what "good" means) per case.

Choose the cheapest valid criterion

In order of preference — use the strongest that fits the task:

Exact / structural match — classification, extraction, JSON shape. Cheap and objective.
Programmatic checks — does it compile, parse, satisfy invariants, stay under length?
Rubric — score against explicit criteria for open-ended output.
LLM-as-judge — last resort for open-ended quality. Calibrate it against human labels and watch for judge bias (favoring length, position, its own style).

Score honestly

Report the pass rate and the specific failures, broken down by category — not a single aggregate number that hides which class of input is broken.
Re-run the eval on every prompt or model change (regression). Pin the test set.
Offline eval is necessary but not sufficient: also sample and review production outputs.

Anti-patterns to flag

Vibes-only evaluation; tiny non-representative sets.
Judging on the same examples you tuned on (overfit).
One aggregate score masking a category that regressed.
An uncalibrated LLM-judge treated as ground truth.

Out of scope: which model to use, pricing, and batch/eval APIs are provider-specific — this skill is about eval design, not the runner.

Similar Skills

code-review-checklist

37.9k

Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.

antigravity-bundle-qa-testing

View code-review-checklist

$

npx claudepluginhub meaganewaller/rosetta --plugin llm-evals