Skill

aep-gen-eval

Reusable generator/evaluator pattern for honest artifact validation. Provides scoring framework, agent contracts, eval protocol, and findings format. Use directly for gen/eval loops or reference from other skills.

developer-tools

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agentic-development-workflow:gen-eval

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

A reusable design pattern for honest evaluation of agent-produced artifacts. Separates the agent that creates work (generator) from the agent that evaluates it (evaluator), because agents consistently praise their own work.

Supporting Files

references/agent-contracts.mdreferences/eval-protocol.mdreferences/findings-format.mdreferences/recovery-ladder.mdreferences/scoring-framework.md

SKILL.md

144 lines · ~2k tokens

Stats

LanguageTypeScript

Stars8

MaintenanceExcellent

Last CommitJun 16, 2026

Actions

View Source View Plugin View on GitHub View README

Generator/Evaluator Pattern

"When asked to evaluate work they've produced, agents tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre." — Anthropic, "Harness Design for Long-Running Application Development"

This skill is both a utility library and a standalone skill:

As a library: Other skills reference its references/ files for scoring, prompts, protocol, and findings format.
As a standalone skill: Invoke directly to run a gen/eval loop on any artifact.

How Other Skills Use This

Skill	What it uses	Reference files
`/aep-build` Phase 5	Scoring framework + eval protocol	`scoring-framework.md`, `eval-protocol.md`
`/aep-launch`	Dimension presets for brainstorming	`scoring-framework.md` (presets section)
`/aep-validate`	Agent prompts + findings format	`agent-contracts.md`, `findings-format.md`, `scoring-framework.md`

Cross-skill reference paths

After sync with aep- prefix, reference files are at:

.claude/skills/aep-gen-eval/references/scoring-framework.md
.claude/skills/aep-gen-eval/references/agent-contracts.md
.claude/skills/aep-gen-eval/references/eval-protocol.md
.claude/skills/aep-gen-eval/references/findings-format.md

The Core Principle

Generator and evaluator must be separate agents. This is not optional — it is the single most impactful quality improvement in agentic workflows.

Why:

Agents cannot honestly evaluate their own work (demonstrated by Anthropic research)
Self-evaluation produces inflated scores and rationalized problems
Separate evaluation catches issues the generator is blind to
The cost of a second agent is trivial compared to shipping broken work

Reference Files

Read these files for detailed specifications. Each file is self-contained.

File	Contents	When to read
`references/scoring-framework.md`	Dimension definitions (1-5 scale), hard failure thresholds, dimension presets (UI, API, security, data, mixed), few-shot examples, anti-patterns	Setting up evaluation criteria, scoring work, calibrating evaluators
`references/agent-contracts.md`	Generator/evaluator role separation, prompt templates (generator, evaluator, protocol checker), context assembly rules	Spawning evaluation agents, assembling prompts
`references/eval-protocol.md`	Eval request/response format, verification JSON schema, the eval loop (request → response → fix → re-evaluate), execution contexts (Task subagent, codex exec, tmux, workflow), the needs-human gate record	Running the evaluation loop, tracking verification state
`references/findings-format.md`	Severity categorization (blocking/important/minor), deduplication protocol, presentation format, changelog entry format	Consolidating findings from multiple agents, presenting results

Standalone Usage

When invoked directly, this skill runs a gen/eval loop on any artifact.

Step 1: Identify the artifact

What is being evaluated? Options:

A document (product context, architecture, design doc)
Code changes (implementation, PR diff)
An OpenSpec change (proposal, design, specs, tasks)
A structured file (YAML, JSON config, migration plan)

Step 2: Choose execution mode

Mode	Agents	When to use
Parallel	Generator + Evaluator spawned simultaneously	Documents, designs, product context — agents work independently
Sequential	Generator first, then Evaluator reads generator's work	Code review — evaluator needs to see the implementation
Loop	Generator → Evaluator → fix → repeat (max 5 rounds)	Active development — generator can fix issues between rounds

Step 3: Configure dimensions

Read references/scoring-framework.md and select the appropriate preset:

Code: Completeness, Correctness, UX Quality, Security, Code Quality
Product/design: Completeness, Consistency, Implementability, Security, Downstream Compatibility
Documents: Accuracy, Executability, Completeness

Or define custom dimensions for the specific artifact.

Step 4: Spawn agents

Read references/agent-contracts.md for prompt templates. Customize the templates with:

The artifact content
The technical constraints
The verification checklist (what the evaluator should check against the codebase)

Step 5: Process results

Read references/findings-format.md for how to consolidate, categorize, and present findings. Apply fixes to the artifact.

Design Decisions

Why a utility skill, not just reference files:

A utility skill can be invoked directly (/aep-gen-eval) for ad-hoc validation
It appears in the skill list, making the pattern discoverable
It has its own description for triggering, so agents use it when appropriate
The references/ directory is still accessible to other skills via path

Why not merge with /aep-validate:

/aep-validate is a product-context skill with 4 specific modes (product, design, code, document)
The gen/eval pattern is more general — it applies to any evaluation scenario
/aep-validate consumes the gen/eval pattern; it is not the pattern itself

Why not keep in /aep-launch:

Launch only sets up criteria; it doesn't run the pattern
The scoring framework is consumed by build, validate, AND launch
Keeping it in launch creates a confusing ownership model

Next Step

After running gen/eval, proceed based on what was evaluated:

Product context → /aep-dispatch
Design artifacts → /aep-launch
Code → create PR or continue /aep-build
Documents → publish or share

aep-gen-eval

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

aep-gen-eval

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Generator/Evaluator Pattern

How Other Skills Use This

Cross-skill reference paths

The Core Principle

Reference Files

Standalone Usage

Step 1: Identify the artifact

Step 2: Choose execution mode

Step 3: Configure dimensions

Step 4: Spawn agents

Step 5: Process results

Design Decisions

Next Step

Similar Skills

Generator/Evaluator Pattern

How Other Skills Use This

Cross-skill reference paths

The Core Principle

Reference Files

Standalone Usage

Step 1: Identify the artifact

Step 2: Choose execution mode

Step 3: Configure dimensions

Step 4: Spawn agents

Step 5: Process results

Design Decisions

Next Step

Similar Skills