From eval-framework
Execute a structured evaluation against a set of LLM outputs and produce a scored report. Use this skill when asked to "run the eval", "score these outputs", "evaluate this response", or "generate an evaluation report".
How this skill is triggered — by the user, by Claude, or both
Slash command
/eval-framework:eval-runThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Apply a scoring rubric to one or more LLM outputs and produce a structured scored report.
Apply a scoring rubric to one or more LLM outputs and produce a structured scored report.
Read or request:
/eval-design to create one)For each output and each rubric dimension:
Record results in a scoring matrix:
Output: <id or label>
Dimension: Correctness Score: 4/5 "The answer covers all main points but omits X."
Dimension: Format Score: 5/5 "JSON schema matches the spec exactly."
Dimension: Clarity Score: 3/5 "Second paragraph is ambiguous about Y."
For each output:
If evaluating multiple outputs:
Output a Markdown report with:
## Evaluation Report
**Task**: <original prompt summary>
**Rubric**: <rubric name/version>
**Outputs evaluated**: <count>
### Scores
| Output | Correctness | Format | Clarity | ... | Total | Pass? |
|--------|-------------|--------|---------|-----|-------|-------|
| A | 4 | 5 | 3 | ... | 3.9 | YES |
| B | 2 | 3 | 2 | ... | 2.4 | NO |
### Key findings
- <summary of strengths across outputs>
- <summary of common weaknesses>
- <recommended next action>
For each failing output, list the top 2 dimensions dragging the score down and suggest concrete rewrites or prompt improvements that would raise those scores.
npx claudepluginhub ats-kinoshita-iso/agent-workshop --plugin eval-frameworkImplements LLM-as-judge techniques for evaluating LLM outputs via direct scoring, pairwise comparison, rubrics, and bias mitigation including position and length bias.
Builds production-grade LLM-as-judge evaluation systems: direct scoring, pairwise comparison, rubric calibration, bias mitigation, and confidence scoring.
Provides rubrics and guidelines for evaluating AI outputs on accuracy, relevance, completeness, helpfulness, clarity, tone appropriateness, and safety. Includes weighting, calibration, and design artifacts.