From agent-patterns
Evaluate the output quality of an agent or pipeline run. Use this skill when asked to "review this agent output", "score this result", "evaluate agent quality", or "suggest improvements" to an agent's response or pipeline output.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-patterns:agent-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evaluate the quality of an agent or pipeline output against defined criteria.
Evaluate the quality of an agent or pipeline output against defined criteria. Produce a structured review with a numeric score and specific improvement suggestions.
Before scoring, identify which dimensions apply to this output:
| Dimension | Description | Applicable? |
|---|---|---|
| Correctness | Output matches the expected answer or solves the problem | Always |
| Completeness | All required sub-tasks or fields are addressed | Always |
| Format compliance | Output matches the required format (JSON, markdown, etc.) | If format specified |
| Conciseness | No unnecessary verbosity or repetition | Always |
| Safety | No harmful, biased, or policy-violating content | Always |
| Tool use quality | Tools called correctly with valid arguments | If tools were used |
Rate each applicable dimension on a scale of 1-5:
1 = Failing (major problems)
2 = Poor (significant issues)
3 = Acceptable (meets minimum bar)
4 = Good (minor issues only)
5 = Excellent (no issues)
For each dimension scored below 4, list concrete issues:
Format:
Issue: <dimension>
Found: "<exact quote from output>"
Problem: <why this is wrong>
Fix: <specific improvement>
Calculate the overall score as a weighted average of dimension scores. Apply this verdict based on the overall score:
| Score | Verdict |
|---|---|
| 4.5 - 5.0 | EXCELLENT -- ready to use |
| 3.5 - 4.4 | GOOD -- minor improvements recommended |
| 2.5 - 3.4 | ACCEPTABLE -- improvements needed before production use |
| 1.5 - 2.4 | POOR -- significant rework required |
| 1.0 - 1.4 | FAILING -- output should be discarded and regenerated |
List 1-3 actionable improvements in priority order:
For each suggestion, include:
npx claudepluginhub ats-kinoshita-iso/agent-workshop --plugin agent-patternsSelf-rates agent output on 5 axes (accuracy, completeness, clarity, actionability, conciseness) with concrete evidence per criterion, producing a structured 1-5 scorecard with improvement suggestions.
Orchestrates parallel multi-agent reviews of completed knowledge work like communications, decisions, analyses, and meeting prep, synthesizing prioritized issues with fixes.
Builds a scoring rubric interactively, evaluates an artifact with multiple models in parallel, then autonomously improves it one criterion at a time until a score threshold is met or circuit breaker fires.