Skill

agent-self-evaluation

From ecc

Self-rates agent output on 5 axes (accuracy, completeness, clarity, actionability, conciseness) with concrete evidence per criterion, producing a structured 1-5 scorecard with improvement suggestions.

developer-tools

Popularity

Stars

217,344

Forks

33,361

Shared by

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ecc:agent-self-evaluation

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

After completing a complex task, the agent pauses to rate its own output against a structured 5-axis rubric. This is NOT a pass/fail gate — it's a deliberate reflection step that catches omissions, flags overconfidence, and surface areas for improvement before the user has to.

Supporting Files

examples/high-score-example.mdexamples/low-score-example.mdreferences/evaluation-criteria.mdreferences/hook-integration.mdscripts/evaluate.pytemplates/evaluation-report.md

SKILL.md

183 lines · ~1.9k tokens

Stats

LanguageJavaScript

Stars217,344

Forks33,361

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Agent Self-Evaluation

When to Activate

After writing code that spans 3+ files or 50+ lines
After completing a multi-step workflow (implement → test → review)
After a debugging session that involved 3+ attempts
After producing a design document, architecture decision, or written analysis
When the user asks "how good was that?" or "rate yourself"
At the end of any session Stop hook (if configured — see references/hook-integration.md)

Core Concepts

The 5 Evaluation Axes

Axis	Question	What it catches
Accuracy	Are the facts, claims, and outputs correct?	Hallucinations, wrong API names, incorrect syntax, false statements
Completeness	Did it cover everything the user asked for?	Missed edge cases, unhandled error paths, forgotten requirements, skipped subtasks
Clarity	Is the explanation understandable and well-structured?	Confusing explanations, jargon without definition, missing context, rambling
Actionability	Can the user act on the output immediately?	Vague suggestions, missing steps, "you should X" without showing how, no verification path
Conciseness	Did it use the minimum words/tokens needed?	Redundancy, over-explanation, repeating the user's question verbatim, filler content

Scoring Scale

5 — Exceptional: no reasonable improvement possible
4 — Good: minor nits only, no substantive gaps
3 — Adequate: meets the request but has a notable weakness on at least one axis
2 — Weak: has a clear gap that affects usability or correctness
1 — Poor: fundamentally misses the request or contains significant errors

The Evidence Rule

Every score below 5 MUST cite specific evidence. A score of 3 cannot just say "could be better" — it must say exactly what is missing or wrong. The mantra: "Show the gap, don't just name it."

Workflow

Step 1: Collect the Raw Material

Gather what you'll evaluate:

- The original user request (read back from conversation)
- Your final response/output (the deliverable)
- Any tool outputs that verify correctness (test results, exit codes, lint output)
- Any user feedback received during the task (corrections, "try again", "that's not right")

Step 2: Score Each Axis Independently

Work through the 5 axes one at a time. For each:

Read the axis question
Find evidence (or lack of evidence) in the output
Assign a score 1-5
If score < 5, write a one-sentence improvement note citing the gap

Do NOT average the scores in your head first and then work backwards. Score each axis fresh.

Step 3: Produce the Evaluation Report

Use the template from templates/evaluation-report.md. The report must include:

- One-line summary
- 5-axis scorecard (score + evidence per axis)
- Overall score (simple average, rounded to 1 decimal)
- 1-3 specific improvements ranked by impact
- Self-check: "Would the user agree with this assessment?"

Step 4: Apply the Improvement

If any axis scored 3 or below:

State what you would do differently
If the gap is fixable in < 30 seconds (missing link, unclear phrasing), fix it now
If the gap requires rework, flag it explicitly: "This axis scored [reason] because [evidence]. Re-running with [specific fix] would likely raise it to [score]."

Code Examples

Example: Good Evaluation (Score 4+)

Task: Add retry logic to HTTP client

Scorecard:
  Accuracy:    5 — All API calls correct. Verified: retries use
                  exponential backoff. No hallucinated methods.
  Completeness: 4 — Covered happy path + 3 error cases. Missing:
                  timeout handling for hung connections.
  Clarity:      5 — Code comments explain backoff formula.
                  PR description links to incident that motivated this.
  Actionability:5 — Single merge. No follow-up tasks. Tests pass.
  Conciseness:  4 — 47 lines total. The retry loop could be extracted
                  into a helper to drop ~8 lines.

Overall: 4.6 — One gap (timeout handling). Fix before merging.

Example: Weak Evaluation (Score 2-3)

Task: Add retry logic to HTTP client

Scorecard:
  Accuracy:    2 — Used urllib3 which doesn't match our
                  httpx-based codebase. Wrong library.
  Completeness: 3 — Works for GET. POST/PUT not handled (user
                  said "all HTTP requests").
  Clarity:      4 — Code is readable. Good variable names.
  Actionability:2 — "Add tests" mentioned but no test file created.
                  User has to write tests before merging.
  Conciseness:  3 — 120 lines. The retry config is duplicated in
                  3 places instead of one shared RetryConfig object.

Overall: 2.8 — Wrong library used. Needs httpx rewrite.
  Fix accuracy first (switch to httpx), then extend to all
  HTTP methods, then consolidate config.

Anti-Patterns

"Everything is a 5"

FAIL: Accuracy:    5 — All good.
   Completeness: 5 — Everything covered.
   Clarity:      5 — Clear.

No evidence cited. This is self-congratulation, not evaluation. A real 5 requires proving there's nothing to improve.

Over-penalizing for scope creep

FAIL: Completeness: 2 — Didn't handle WebSocket connections or
   gRPC streaming (user didn't ask for these)

Only evaluate against what the user actually requested, not what you could have additionally built.

Using the evaluation to re-litigate

FAIL: "As I said earlier, this approach is wrong. Score: 1"

The evaluation is about the delivered output, not about re-arguing design decisions that were already made. If the approach was wrong, that should have been caught before delivery.

Mixing personal preference with objective gaps

FAIL: "Score: 3. I don't like Python decorators."

"Don't like" is not evidence. Cite a concrete readability, testability, or correctness concern, or leave the score at 4+.

Best Practices

Evaluate the output, not the process. The user cares about what you delivered, not how many iterations you took.
One improvement per weak axis. Don't list 5 things for one axis — pick the highest-impact gap.
Tie improvements to user impact. "Missing error handling means the user's API call will crash silently" beats "add error handling."
Be specific about what 'fixed' looks like. "Re-run with httpx transport configured for retries" beats "fix the library issue."
Use tool outputs as evidence. If tests passed, cite them. If lint is clean, cite it. Don't guess — grep for the proof.
If you can't find any gaps, try harder. A perfect score across all 5 axes is rare. Ask: "If I were the user, what would annoy me about this output?"

Related Skills

agent-eval — Head-to-head comparison of different coding agents on benchmark tasks
verification-loop — Systematic verification of outputs against expected results
security-review — Security-focused code review checklist

agent-self-evaluation

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

agent-self-evaluation

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Agent Self-Evaluation

When to Activate

Core Concepts

The 5 Evaluation Axes

Scoring Scale

The Evidence Rule

Workflow

Step 1: Collect the Raw Material

Step 2: Score Each Axis Independently

Step 3: Produce the Evaluation Report

Step 4: Apply the Improvement

Code Examples

Example: Good Evaluation (Score 4+)

Example: Weak Evaluation (Score 2-3)

Anti-Patterns

"Everything is a 5"

Over-penalizing for scope creep

Using the evaluation to re-litigate

Mixing personal preference with objective gaps

Best Practices

Related Skills

Similar Skills

Agent Self-Evaluation

When to Activate

Core Concepts

The 5 Evaluation Axes

Scoring Scale

The Evidence Rule

Workflow

Step 1: Collect the Raw Material

Step 2: Score Each Axis Independently

Step 3: Produce the Evaluation Report

Step 4: Apply the Improvement

Code Examples

Example: Good Evaluation (Score 4+)

Example: Weak Evaluation (Score 2-3)

Anti-Patterns

"Everything is a 5"

Over-penalizing for scope creep

Using the evaluation to re-litigate

Mixing personal preference with objective gaps

Best Practices

Related Skills

Similar Skills