From evaluation
Provides rubrics and guidelines for evaluating AI outputs on accuracy, relevance, completeness, helpfulness, clarity, tone appropriateness, and safety. Includes weighting, calibration, and design artifacts.
How this skill is triggered — by the user, by Claude, or both
Slash command
/evaluation:output-quality-rubricsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Without a rubric, quality evaluation is subjective and inconsistent. A rubric defines what "good" means in concrete, measurable terms — so different evaluators reach the same conclusions.
Without a rubric, quality evaluation is subjective and inconsistent. A rubric defines what "good" means in concrete, measurable terms — so different evaluators reach the same conclusions.
For each dimension, define a scale: Example — Accuracy (1-5):
Not all dimensions matter equally for every use case:
A rubric is only useful if evaluators use it consistently:
npx claudepluginhub owl-listener/ai-design-skills --plugin evaluationBuilds production-grade LLM-as-judge evaluation systems: direct scoring, pairwise comparison, rubric calibration, bias mitigation, and confidence scoring.
Implements LLM-as-judge techniques for evaluating LLM outputs via direct scoring, pairwise comparison, rubrics, and bias mitigation including position and length bias.
Builds a scoring rubric interactively, evaluates an artifact with multiple models in parallel, then autonomously improves it one criterion at a time until a score threshold is met or circuit breaker fires.