From eval-framework
Design evaluation criteria and a 1-5 scoring rubric for a task or LLM output. Use this skill when asked to "create an eval", "define evaluation criteria", "build a scoring rubric", or "design how to measure quality" for any output.
How this skill is triggered — by the user, by Claude, or both
Slash command
/eval-framework:eval-designThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Design a structured evaluation framework with explicit scoring criteria for the given task.
Design a structured evaluation framework with explicit scoring criteria for the given task.
Read the task description and identify:
Create 3–6 independent dimensions that together cover output quality. For each dimension:
Dimension: <name>
Weight: <percentage of total score, must sum to 100>
Question: <one question an evaluator answers to score this dimension>
Common dimensions for LLM evaluation:
For each dimension, define what each score level looks like:
Score 5 (Excellent): <concrete description of a 5/5 response>
Score 4 (Good): <concrete description of a 4/5 response>
Score 3 (Acceptable): <concrete description of a 3/5 response>
Score 2 (Poor): <concrete description of a 2/5 response>
Score 1 (Failing): <concrete description of a 1/5 response>
Anchored rubrics reduce inter-rater variance. Each level must be distinguishable — a reader seeing two outputs should be able to consistently assign different scores.
Specify:
Run a quick sanity check:
Present the final rubric as a Markdown table:
| Dimension | Weight | Score 1 | Score 3 | Score 5 |
|---|---|---|---|---|
| ... | ...% | ... | ... | ... |
Include pass/fail thresholds below the table.
npx claudepluginhub ats-kinoshita-iso/agent-workshop --plugin eval-frameworkDesigns structured scoring rubrics with explicit criteria, performance scales, and descriptors for consistent quality assessment and reduced subjective bias.
Creates assessment rubrics with explicit performance criteria and quality levels for consistent, transparent grading of student work or projects.
Provides rubrics and guidelines for evaluating AI outputs on accuracy, relevance, completeness, helpfulness, clarity, tone appropriateness, and safety. Includes weighting, calibration, and design artifacts.