From ecc
Self-rates agent output on 5 axes (accuracy, completeness, clarity, actionability, conciseness) with concrete evidence per criterion, producing a structured 1-5 scorecard with improvement suggestions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ecc:agent-self-evaluationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
After completing a complex task, the agent pauses to rate its own output against a structured 5-axis rubric. This is NOT a pass/fail gate — it's a deliberate reflection step that catches omissions, flags overconfidence, and surface areas for improvement before the user has to.
After completing a complex task, the agent pauses to rate its own output against a structured 5-axis rubric. This is NOT a pass/fail gate — it's a deliberate reflection step that catches omissions, flags overconfidence, and surface areas for improvement before the user has to.
references/hook-integration.md)| Axis | Question | What it catches |
|---|---|---|
| Accuracy | Are the facts, claims, and outputs correct? | Hallucinations, wrong API names, incorrect syntax, false statements |
| Completeness | Did it cover everything the user asked for? | Missed edge cases, unhandled error paths, forgotten requirements, skipped subtasks |
| Clarity | Is the explanation understandable and well-structured? | Confusing explanations, jargon without definition, missing context, rambling |
| Actionability | Can the user act on the output immediately? | Vague suggestions, missing steps, "you should X" without showing how, no verification path |
| Conciseness | Did it use the minimum words/tokens needed? | Redundancy, over-explanation, repeating the user's question verbatim, filler content |
5 — Exceptional: no reasonable improvement possible
4 — Good: minor nits only, no substantive gaps
3 — Adequate: meets the request but has a notable weakness on at least one axis
2 — Weak: has a clear gap that affects usability or correctness
1 — Poor: fundamentally misses the request or contains significant errors
Every score below 5 MUST cite specific evidence. A score of 3 cannot just say "could be better" — it must say exactly what is missing or wrong. The mantra: "Show the gap, don't just name it."
Gather what you'll evaluate:
- The original user request (read back from conversation)
- Your final response/output (the deliverable)
- Any tool outputs that verify correctness (test results, exit codes, lint output)
- Any user feedback received during the task (corrections, "try again", "that's not right")
Work through the 5 axes one at a time. For each:
Do NOT average the scores in your head first and then work backwards. Score each axis fresh.
Use the template from templates/evaluation-report.md. The report must include:
- One-line summary
- 5-axis scorecard (score + evidence per axis)
- Overall score (simple average, rounded to 1 decimal)
- 1-3 specific improvements ranked by impact
- Self-check: "Would the user agree with this assessment?"
If any axis scored 3 or below:
Task: Add retry logic to HTTP client
Scorecard:
Accuracy: 5 — All API calls correct. Verified: retries use
exponential backoff. No hallucinated methods.
Completeness: 4 — Covered happy path + 3 error cases. Missing:
timeout handling for hung connections.
Clarity: 5 — Code comments explain backoff formula.
PR description links to incident that motivated this.
Actionability:5 — Single merge. No follow-up tasks. Tests pass.
Conciseness: 4 — 47 lines total. The retry loop could be extracted
into a helper to drop ~8 lines.
Overall: 4.6 — One gap (timeout handling). Fix before merging.
Task: Add retry logic to HTTP client
Scorecard:
Accuracy: 2 — Used urllib3 which doesn't match our
httpx-based codebase. Wrong library.
Completeness: 3 — Works for GET. POST/PUT not handled (user
said "all HTTP requests").
Clarity: 4 — Code is readable. Good variable names.
Actionability:2 — "Add tests" mentioned but no test file created.
User has to write tests before merging.
Conciseness: 3 — 120 lines. The retry config is duplicated in
3 places instead of one shared RetryConfig object.
Overall: 2.8 — Wrong library used. Needs httpx rewrite.
Fix accuracy first (switch to httpx), then extend to all
HTTP methods, then consolidate config.
FAIL: Accuracy: 5 — All good.
Completeness: 5 — Everything covered.
Clarity: 5 — Clear.
No evidence cited. This is self-congratulation, not evaluation. A real 5 requires proving there's nothing to improve.
FAIL: Completeness: 2 — Didn't handle WebSocket connections or
gRPC streaming (user didn't ask for these)
Only evaluate against what the user actually requested, not what you could have additionally built.
FAIL: "As I said earlier, this approach is wrong. Score: 1"
The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.
FAIL: "Score: 3. I don't like Python decorators."
"Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.
agent-eval — Head-to-head comparison of different coding agents on benchmark tasksverification-loop — Systematic verification of outputs against expected resultssecurity-review — Security-focused code review checklistnpx claudepluginhub affaan-m/ecc --plugin eccScores own output 0-10 across 5 task-appropriate dimensions before emitting. Used as a pre-emit gate for complex work where grade-inflation is a risk.
Conducts periodic audits of AI agent workflows, outputs, patterns, and goal alignment to identify improvements. Use after project phases, sprints, or performance plateaus.
Assesses code, designs, or approaches with 0-10 rating, pros/cons analysis, and actionable recommendations. Use for evaluating quality or trade-offs.