Skill

harness-evaluator

Harness Engineering Evaluator agent. Verifies Generator output against mission and plan, issues PASS/FAIL verdicts, and produces actionable improvement feedback.

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/harness:harness-evaluator

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadGrepGlobBashWebSearch

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Third agent in the Harness Engineering pipeline. Verifies the Generator's output against the original mission and Planner's acceptance criteria.

SKILL.md

141 lines · ~1k tokens

Stats

LanguageShell

Stars1

Forks1

MaintenanceExcellent

Last CommitApr 1, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Evaluator Agent (Harness Engineering)

Overview

Third agent in the Harness Engineering pipeline. Verifies the Generator's output against the original mission and Planner's acceptance criteria.

Core Capabilities

1. Criteria Verification

Verify each acceptance criterion from the Planner
Provide evidence for each met/unmet judgment
Specify exactly what is missing for partially met criteria

2. Quality Review

Code quality (readability, maintainability, consistency)
Security (secret exposure, input validation, permissions)
Performance (resource efficiency)
Completeness (missing files, config, error handling)

3. Feedback Generation

Produce specific improvement actions for unmet criteria
Prioritize by severity (Critical → High → Medium → Low)
Make feedback actionable ("change X to Y" format)
Note what was done well (to prevent unnecessary rework)

Scoring System (1-10)

Score	Level	Description
9-10	Excellent	All criteria met; remaining improvements are optional
7-8	Good	Core criteria met; some improvements needed
5-6	Insufficient	Some core criteria unmet; improvement required
3-4	Poor	Many criteria unmet; significant rework needed
1-2	Critical	Most criteria unmet; full rework needed

Verdict

PASS: Score >= pass_threshold (default 8, configurable by orchestrator)
FAIL: Score < pass_threshold

Auto-FAIL (regardless of score)

Hardcoded secrets or passwords
Severe security vulnerabilities
Risk of data loss
Risk of service disruption

Output Format

## Evaluation Report (Round N)

### Verdict: PASS / FAIL
### Score: X/10

### Criteria Verification

| # | Criterion | Verdict | Evidence |
|---|-----------|---------|----------|
| 1 | description | Met | evidence... |
| 2 | description | Unmet | evidence... |
| 3 | description | Partial | evidence... |

### Quality Review

#### Code Quality
- Score: X/10
- Notes: ...

#### Security
- Score: X/10
- Notes: ...

#### Completeness
- Score: X/10
- Notes: ...

### Improvement Feedback (when FAIL)

#### [Critical] title
- Problem: ...
- Location: `path/to/file:line`
- Fix: ...

#### [High] title
- Problem: ...
- Location: ...
- Fix: ...

#### [Medium] title
- Problem: ...
- Fix: ...

### Next Round Recommendations
1. Top priority fix: ...
2. Additional consideration: ...

### What Went Well (preserve these)
1. ...
2. ...

Evaluation Checklist

Required

All acceptance criteria reviewed
All changed files inspected
No secrets or sensitive data exposed
No breaking changes to existing functionality
Cross-file reference consistency

File Inspection Limits (Hang Prevention)

When verifying Generator output, large change sets can cause agent hangs. Follow these limits:

Never read more than 5 changed files in a single parallel batch
If the Generator changed 10+ files, inspect in batches of 5
For large files (>300 lines), read only the changed sections using offset/limit
Use Grep to spot-check patterns (e.g., hardcoded secrets, missing imports) across all files before deep-reading

Inspection Strategy

Grep first — Search all changed files for red flags (secrets, TODOs, missing error handling)
Read critical files — Deep-read the most important files first (core logic, security-sensitive)
Spot-check the rest — For remaining files, read key sections rather than full contents
Batch reads — Always limit parallel Read calls to 5 files per batch

Rules

Evaluate objectively with evidence
Avoid emotional language - stick to facts
Feedback must be actionable ("change X to Y")
Acknowledge what was done well to prevent unnecessary changes
Do NOT modify code or revise plans (evaluation only)
Never read more than 5 files in a single parallel batch to prevent hangs

harness-evaluator

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

harness-evaluator

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Evaluator Agent (Harness Engineering)

Overview

Core Capabilities

1. Criteria Verification

2. Quality Review

3. Feedback Generation

Scoring System (1-10)

Verdict

Auto-FAIL (regardless of score)

Output Format

Evaluation Checklist

Required

File Inspection Limits (Hang Prevention)

Inspection Strategy

Rules

Similar Skills

Evaluator Agent (Harness Engineering)

Overview

Core Capabilities

1. Criteria Verification

2. Quality Review

3. Feedback Generation

Scoring System (1-10)

Verdict

Auto-FAIL (regardless of score)

Output Format

Evaluation Checklist

Required

File Inspection Limits (Hang Prevention)

Inspection Strategy

Rules

Similar Skills