From eval-framework
Compare two LLM outputs on the same evaluation criteria and recommend a winner with justification. Use this skill when asked to "compare these outputs", "which response is better", "A/B eval", or "pick the best candidate".
How this skill is triggered — by the user, by Claude, or both
Slash command
/eval-framework:eval-compareThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Compare two candidate outputs on shared evaluation criteria and produce a justified recommendation.
Compare two candidate outputs on shared evaluation criteria and produce a justified recommendation.
Read or request:
/eval-design to create one, or define ad-hoc dimensions)If no rubric exists, generate a minimal one on the spot:
Score Output A across all dimensions first, then score Output B. Do NOT read Output B while scoring Output A — this prevents anchoring bias.
Record scores in a comparison table:
| Dimension | Weight | Score A | Score B | Notes |
|---|---|---|---|---|
| ... | ...% | ... | ... | ... |
For each dimension, mark which output wins (A, B, or Tie):
Report trade-offs explicitly when one output wins on some dimensions but loses on others:
Output A is stronger on: Correctness (+1.5 pts), Completeness (+0.5 pts)
Output B is stronger on: Clarity (+1.0 pts), Format compliance (+1.0 pts)
Ask the user: which dimensions matter most for this use case? If the user provides a priority, recompute with adjusted weights and verify the winner is unchanged.
Write a concise recommendation:
## Recommendation: Output <A|B>
**Winner**: Output <A|B> (score: <X.X> vs <Y.Y>)
**Primary reasons**:
- <reason 1 — cite specific text from the winning output>
- <reason 2>
**Caveats**:
- <where the losing output does better, if relevant>
- <conditions under which the recommendation would flip>
**Suggested improvement for the winner**:
- <one concrete change that would make the winning output even better>
Before presenting, run a quick sanity check:
npx claudepluginhub ats-kinoshita-iso/agent-workshop --plugin eval-frameworkImplements LLM-as-judge techniques for evaluating LLM outputs via direct scoring, pairwise comparison, rubrics, and bias mitigation including position and length bias.
Builds production-grade LLM-as-judge evaluation systems: direct scoring, pairwise comparison, rubric calibration, bias mitigation, and confidence scoring.
Guides A/B testing, side-by-side comparisons, preference ranking, paired comparisons, and Elo ratings for evaluating AI outputs and detecting subtle quality differences missed by absolute scores.