Search everything...

Stats

Actions

Available In

trine-eval

Name: trine-eval
Author: ats-kinoshita-iso

By ats-kinoshita-iso

Modular three-agent eval-driven development harness implementing Anthropic's Planner-Generator-Evaluator architecture

npx claudepluginhub ats-kinoshita-iso/trine-eval --plugin trine-eval

Popularity

Stars

Med: 0·Avg: 285

Installs

Med: 0·Avg: 1

What's Inside

Agents3

evaluator

/evaluator

Adversarial QA agent that tests sprint deliverables against contracts

generator

/generator

Implements one sprint at a time with contract negotiation, git commits, and self-review

planner

/planner

Expands user prompts into product specifications with sprint decomposition

Skills6

bootstrap-failures

/bootstrap-failures

Import real failure cases from bug reports, incidents, and manual tests to seed the eval suite

eval-rubric

/eval-rubric

Load domain-specific evaluation criteria and grading weights for the project type

harness-kickoff

/harness-kickoff

Initialize eval-driven development with Planner-Generator-Evaluator architecture

harness-sprint

/harness-sprint

Run one sprint through the contract-build-eval cycle

harness-summary

/harness-summary

Cross-sprint analysis showing pass rates, consistency metrics, trends, and failure patterns

Hooks1

Event Hooks

File writes

3 hooks across 3 events

Stats

Version0.3.3

ReleasedMay 7, 2026

LanguagePython

Stars0

MaintenanceExcellent

LicenseMIT

Last CommitMay 21, 2026

AddedApr 12, 2026

Actions

View on GitHub View README Plugin Marketplace JSON

Own this plugin?

Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).

Available In

trine-eval

Safety Signals

Caution

Modifies files

Hook triggers on file write and edit operations

Uses power tools

Uses Bash, Write, or Edit tools

README

trine-eval

A modular three-agent eval-driven development harness for Claude Code. Implements Anthropic's Planner-Generator-Evaluator architecture as a portable, rubric-swappable plugin.

How It Works

Three agents collaborate through files on disk — no shared context:

Planner — Expands a short user prompt into a product spec with sprint decomposition
Generator — Implements one sprint at a time, negotiating testable success criteria before writing code
Evaluator — Adversarially tests each sprint against its contract, grading PASS/FAIL with specific evidence

The agents communicate exclusively through files in a .harness/ directory. Each sprint follows a contract→build→eval→retry cycle.

Installation

Copy or clone this repository into your Claude Code plugins directory, or install from a target project:

# From your project directory
claude /plugin install /path/to/eval-harness

Usage

1. Initialize the Harness

/harness-kickoff Build a task management API with user authentication and team workspaces

This will:

Detect your project type (or ask you to choose)
Create the .harness/ directory with configuration
Run the Planner to produce a spec and sprint plan
Present the plan for your review

2. Run Sprints

/harness-sprint

Runs the next incomplete sprint through the full cycle:

Generator proposes a sprint contract with testable success criteria
Evaluator reviews the contract for testability and completeness
Generator implements the sprint
Evaluator tests against the contract
If failures: Generator fixes, Evaluator re-tests (up to 3 rounds)

To target a specific sprint:

/harness-sprint 3

3. View Progress

/harness-summary

Generates a cross-sprint analysis with pass rates, trends, failure patterns, and recommendations.

Project Types and Rubrics

The harness ships with four rubrics. Set the project type during kickoff or in .harness/config.json:

Type	Rubric	Key Dimensions
`web-app`	web-app	Functionality, Visual Design, Code Quality, Robustness
`rag-system`	rag-system	Retrieval Quality, Answer Faithfulness, System Robustness, Architecture
`cli-tool`	cli-tool	Functionality, Usability, Error Handling, Code Quality
`api-service`	api-service	Correctness, Robustness, API Design, Code Quality

Configuration

.harness/config.json controls harness behavior:

{
  "project_type": "web-app",
  "rubric": "web-app",
  "max_retries": 3,
  "pass_threshold": {
    "per_dimension_minimum": 2,
    "critical_dimensions": ["functionality"],
    "critical_minimum": 3
  },
  "contract_negotiation_rounds": 2,
  "git_checkpoint": true,
  "components_enabled": {
    "planner": true,
    "contract_negotiation": true,
    "sprint_decomposition": true,
    "eval_summary": true
  }
}

Disabling Components

The components_enabled section lets you simplify the harness as models improve. For example, disabling contract_negotiation skips the Evaluator's contract review — the Generator's proposed criteria are accepted directly.

Phase 2 Configuration Knobs

These optional fields extend .harness/config.json with backward-compatible defaults. A config that omits them runs exactly as in Phase 1.

thinking.profile — one of "default", "fast", or "thorough". Default: "default", which preserves Phase-1 behavior (no override applied to agent-level adaptive-thinking effort declared in agent frontmatter). The "fast" and "thorough" values are reserved for a future override dispatcher; the default is the only path that mutates today's behavior, and it does not. Each role's effort level is declared in the agent's own frontmatter (agents/planner.md, agents/generator.md, agents/evaluator.md, and the harness-summary skill) — medium for routine planning and implementation, high for capability evaluation, and max for contract review and cross-sprint summary analysis.
batch.enabled — boolean. Default: false. When true and a sprint has at least batch.min_criteria criteria, eval verifications are submitted as a single Anthropic Batch API call (50% discount on input/output tokens, 24-hour SLA). Batch is a cost optimization, not a latency optimization. With the default false, evaluations run synchronously as in Phase 1.
batch.min_criteria — integer. Default: 20. Sprints with fewer criteria stay synchronous even when batch.enabled is true — the batch overhead is only worth absorbing on large suites. The criterion count compared against this threshold is the same count emitted in sprint-{NN}.tasks.json (success criteria + Should-NOT gates).

View full README on GitHub

trine-eval

Popularity

What's Inside

Confidence

README

trine-eval

How It Works

Installation

Usage

1. Initialize the Harness

2. Run Sprints

3. View Progress

Project Types and Rubrics

Configuration

Disabling Components

Phase 2 Configuration Knobs

Similar Plugins

tandemkit

agent-skills

harness-claude

conductor

claudekit

agentic-dev-team

More by ats-kinoshita-iso

eval-framework

trine-eval

prompt-optimizer

anthropic-dev-skills

agent-patterns

trine-eval

How It Works

Installation

Usage

1. Initialize the Harness

2. Run Sprints

3. View Progress

Project Types and Rubrics

Configuration

Disabling Components

Phase 2 Configuration Knobs

Popularity

Health & Quality

More by ats-kinoshita-iso

eval-framework

trine-eval

prompt-optimizer

anthropic-dev-skills

agent-patterns

Similar Plugins

tandemkit

agent-skills

harness-claude

conductor

claudekit

agentic-dev-team