Modular three-agent eval-driven development harness implementing Anthropic's Planner-Generator-Evaluator architecture
Import real failure cases from bug reports, incidents, and manual tests to seed the eval suite
Load domain-specific evaluation criteria and grading weights for the project type
Initialize eval-driven development with Planner-Generator-Evaluator architecture
Run one sprint through the contract-build-eval cycle
Cross-sprint analysis showing pass rates, consistency metrics, trends, and failure patterns
Modifies files
Hook triggers on file write and edit operations
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A modular three-agent eval-driven development harness for Claude Code. Implements Anthropic's Planner-Generator-Evaluator architecture as a portable, rubric-swappable plugin.
Three agents collaborate through files on disk — no shared context:
The agents communicate exclusively through files in a .harness/ directory. Each sprint follows a contract→build→eval→retry cycle.
Copy or clone this repository into your Claude Code plugins directory, or install from a target project:
# From your project directory
claude /plugin install /path/to/eval-harness
/harness-kickoff Build a task management API with user authentication and team workspaces
This will:
.harness/ directory with configuration/harness-sprint
Runs the next incomplete sprint through the full cycle:
To target a specific sprint:
/harness-sprint 3
/harness-summary
Generates a cross-sprint analysis with pass rates, trends, failure patterns, and recommendations.
The harness ships with four rubrics. Set the project type during kickoff or in .harness/config.json:
| Type | Rubric | Key Dimensions |
|---|---|---|
web-app | web-app | Functionality, Visual Design, Code Quality, Robustness |
rag-system | rag-system | Retrieval Quality, Answer Faithfulness, System Robustness, Architecture |
cli-tool | cli-tool | Functionality, Usability, Error Handling, Code Quality |
api-service | api-service | Correctness, Robustness, API Design, Code Quality |
.harness/config.json controls harness behavior:
{
"project_type": "web-app",
"rubric": "web-app",
"max_retries": 3,
"pass_threshold": {
"per_dimension_minimum": 2,
"critical_dimensions": ["functionality"],
"critical_minimum": 3
},
"contract_negotiation_rounds": 2,
"git_checkpoint": true,
"components_enabled": {
"planner": true,
"contract_negotiation": true,
"sprint_decomposition": true,
"eval_summary": true
}
}
The components_enabled section lets you simplify the harness as models improve. For example, disabling contract_negotiation skips the Evaluator's contract review — the Generator's proposed criteria are accepted directly.
These optional fields extend .harness/config.json with backward-compatible defaults. A config that omits them runs exactly as in Phase 1.
thinking.profile — one of "default", "fast", or "thorough". Default: "default", which preserves Phase-1 behavior (no override applied to agent-level adaptive-thinking effort declared in agent frontmatter). The "fast" and "thorough" values are reserved for a future override dispatcher; the default is the only path that mutates today's behavior, and it does not. Each role's effort level is declared in the agent's own frontmatter (agents/planner.md, agents/generator.md, agents/evaluator.md, and the harness-summary skill) — medium for routine planning and implementation, high for capability evaluation, and max for contract review and cross-sprint summary analysis.
batch.enabled — boolean. Default: false. When true and a sprint has at least batch.min_criteria criteria, eval verifications are submitted as a single Anthropic Batch API call (50% discount on input/output tokens, 24-hour SLA). Batch is a cost optimization, not a latency optimization. With the default false, evaluations run synchronously as in Phase 1.
batch.min_criteria — integer. Default: 20. Sprints with fewer criteria stay synchronous even when batch.enabled is true — the batch overhead is only worth absorbing on large suites. The criterion count compared against this threshold is the same count emitted in sprint-{NN}.tasks.json (success criteria + Should-NOT gates).
npx claudepluginhub ats-kinoshita-iso/trine-eval --plugin trine-evalEvaluation framework skills for designing scoring rubrics, running structured evaluations on LLM outputs, and comparing candidate outputs to recommend a winner.
trine-eval: Planner-Generator-Evaluator harness for eval-driven development across web/RAG/CLI/API projects, eval-harness methodology audits (meta layer), and harness-build agent-runtime conformance (runtime layer)
Analyzes and rewrites prompts for Claude Code, applying structured prompt engineering patterns to produce clearer, more effective instructions.
Anthropic's official development skills for Claude API integration, MCP server building, skill creation, web artifact building, and browser-based testing.
Skills for designing and evaluating multi-agent systems: orchestrator/worker decomposition, output quality review, and self-improving evaluator/optimizer loops.
Describe your goal, approve the spec, then step away — Claude and Codex loop together until it's right.
Production-grade engineering skills for AI coding agents — covering the full software development lifecycle from spec to ship.
Harness for Claude Code — skills, /harness:* slash commands, persona subagents, lifecycle hooks, and MCP tools without per-repo `harness setup`. Sibling plugins exist for Cursor, Gemini CLI, and Codex.
Context-Driven Development plugin that transforms Claude Code into a project management tool with structured workflow: Context → Spec & Plan → Implement
Verification-first engineering toolkit for Claude Code. 15 skills across a 5-phase spine (Investigate → Design → Implement → Verify → Ship), 8 specialist agents, an interactive setup wizard. Every skill has rationalizations + evidence requirements. Built for senior ICs and tech leads.
Persona-driven AI development team: orchestrator, team agents, review agents, skills, slash commands, and advisory hooks for Claude Code