From rz-ai-researcher
Design an evaluation for an AI capability — rubric, dataset, methodology, success criteria. Use when the ticket asks how to measure something (prompt quality, retrieval relevance, consolidation, etc.) without pre-selecting a specific technique.
How this skill is triggered — by the user, by Claude, or both
Slash command
/rz-ai-researcher:eval-specThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Invoke when the ticket:
Invoke when the ticket:
For comparing variants, use ablation-study (which uses an eval but has additional structure). For evaluating a specific named technique, use method-eval.
# Eval spec — {what we're measuring}
**Date:** {YYYY-MM-DD}
**Target app:** {app_id}
**Triggering ticket:** [CAR-{n}]({URL})
**Confidence:** {low/medium/high}
## What we're measuring
Plain-language statement. E.g., "Quality of SIA's consolidation output given N raw sources covering the same topic."
## Why this eval matters
What decision does this eval unblock? Without this eval, what risk remains?
## Rubric
### Dimension 1 — {name}
- **5:** {specific, observable behavior}
- **4:** {slightly below ideal}
- **3:** {acceptable baseline}
- **2:** {clear issue}
- **1:** {broken}
### Dimension 2 — {name}
- 5: …
- 3: …
- 1: …
### Dimension 3 — …
### Overall score
How dimensions compose into overall:
- **Gated:** Fail one = fail all. Use when a dimension is non-negotiable.
- **Weighted:** Weighted sum with weights summing to 100. Use when tradeoffs are acceptable.
Say which and why.
## Dataset
- **Source of inputs:** e.g., "50 sampled consolidation tasks from Langfuse traces, Mar–Apr 2026, stratified by source count"
- **Ground truth:** How determined. Options:
- Human-labeled (how many labelers? agreement threshold?)
- LLM-as-judge (which judge model? prompt template?)
- Synthetic / programmatic
- Held-out reference (if one exists)
- **Class balance / coverage:** Stratification across important dimensions (e.g., short vs. long inputs, English vs. other, etc.)
- **Size:** Starting N; plan for growth.
## Methodology
### Judge model (if LLM-as-judge)
- **Model:** {Opus / Sonnet / other}
- **Prompt:** (paste full prompt or link to Langfuse prompt ID)
- **Calibration:** Human vs. judge agreement check on a golden subset ({N} samples)
### Statistical treatment
- Confidence intervals (bootstrap, assumption)
- Significance thresholds for comparing variants (e.g., p < 0.05)
- Multiple comparison correction if needed
### Reproducibility
- Seed control
- Eval infrastructure (where it runs, how to rerun)
## Success criteria
- **Absolute thresholds:** e.g., "overall ≥ 4.0 across the dataset"
- **Relative thresholds** (for comparing variants): "new variant must beat baseline by ≥ 0.3 overall with p < 0.05"
- **Minimum coverage:** what fraction of inputs must score ≥ threshold
## Failure modes to watch for
- **Judge drift:** Same judge scoring differently over time or for similar inputs
- **Dataset bias:** Eval set doesn't represent production distribution
- **Overfitting to eval:** Developers tune to the eval set, regressing on held-out real inputs
- **Gaming the rubric:** A variant scores well but doesn't solve the actual problem (specification gaming)
## Calibration plan
Before putting this eval into production gating:
1. Human-label {N} golden samples
2. Run judge on same samples
3. Measure inter-rater reliability + judge-human agreement
4. Target: {Cohen's kappa ≥ 0.7} or equivalent
Recalibrate every {quarter / as drift is detected}.
## Implementation handoff
AI Engineer ticket scope: {bullet list of what to build}.
## Confidence: {low/medium/high}
npx claudepluginhub rczamor/rz-agent-team --plugin rz-ai-researcherCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.