From llm-evals
Use when building or reviewing an evaluation for an LLM feature — assembling a representative test set, choosing pass criteria (exact match, programmatic checks, rubric, or LLM-as-judge), and catching regressions. Use when asking "how do I know this prompt or model change is better?"
How this skill is triggered — by the user, by Claude, or both
Slash command
/llm-evals:designing-llm-evalsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
If you can't measure it, you're tuning by vibes. Decide how you'll know the output is good
If you can't measure it, you're tuning by vibes. Decide how you'll know the output is good before you start changing prompts or models.
In order of preference — use the strongest that fits the task:
Out of scope: which model to use, pricing, and batch/eval APIs are provider-specific — this skill is about eval design, not the runner.
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub meaganewaller/rosetta --plugin llm-evals