trine-eval: Planner-Generator-Evaluator harness for eval-driven development across web/RAG/CLI/API projects, eval-harness methodology audits (meta layer), and harness-build agent-runtime conformance (runtime layer)
Import real failure cases from bug reports, incidents, and manual tests to seed the eval suite
Load domain-specific evaluation criteria and grading weights for the project type
Initialize eval-driven development with Planner-Generator-Evaluator architecture
Run one sprint through the contract-build-eval cycle
Cross-sprint analysis showing pass rates, consistency metrics, trends, and failure patterns
Modifies files
Hook triggers on file write and edit operations
Uses power tools
Uses Bash, Write, or Edit tools
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A modular three-agent eval-driven development harness for Claude Code. Implements Anthropic's Planner-Generator-Evaluator architecture as a portable, rubric-swappable plugin (plugins/trine-eval/), shipped alongside the trine_eval Python eval library (src/trine_eval/) that the harness builds and exercises.
Three agents collaborate through files on disk — no shared context:
The agents communicate exclusively through files in a .harness/ directory. Each sprint follows a contract→build→eval→retry cycle.
Copy or clone this repository into your Claude Code plugins directory, or install from a target project:
# From your project directory
claude /plugin install /path/to/trine-eval
/harness-kickoff Build a task management API with user authentication and team workspaces
This will:
.harness/ directory with configuration/harness-sprint
Runs the next incomplete sprint through the full cycle:
To target a specific sprint:
/harness-sprint 3
/harness-summary
Generates a cross-sprint analysis with pass rates, trends, failure patterns, and recommendations.
The harness ships with four rubrics. Set the project type during kickoff or in .harness/config.json:
| Type | Rubric | Key Dimensions |
|---|---|---|
web-app | web-app | Functionality, Visual Design, Code Quality, Robustness |
rag-system | rag-system | Retrieval Quality, Answer Faithfulness, System Robustness, Architecture |
cli-tool | cli-tool | Functionality, Usability, Error Handling, Code Quality |
api-service | api-service | Correctness, Robustness, API Design, Code Quality |
.harness/config.json controls harness behavior:
{
"project_type": "web-app",
"rubric": "web-app",
"max_retries": 3,
"pass_threshold": {
"per_dimension_minimum": 2,
"critical_dimensions": ["functionality"],
"critical_minimum": 3
},
"contract_negotiation_rounds": 2,
"git_checkpoint": true,
"components_enabled": {
"planner": true,
"contract_negotiation": true,
"sprint_decomposition": true,
"eval_summary": true
}
}
The components_enabled section lets you simplify the harness as models improve. For example, disabling contract_negotiation skips the Evaluator's contract review — the Generator's proposed criteria are accepted directly.
These optional fields extend .harness/config.json with backward-compatible defaults. A config that omits them runs exactly as in Phase 1.
thinking.profile — one of "default", "fast", or "thorough". Default: "default", which preserves Phase-1 behavior (no override applied to agent-level adaptive-thinking effort declared in agent frontmatter). The "fast" and "thorough" values are reserved for a future override dispatcher; the default is the only path that mutates today's behavior, and it does not. Each role's effort level is declared in the agent's own frontmatter (plugins/trine-eval/agents/planner.md, plugins/trine-eval/agents/generator.md, plugins/trine-eval/agents/evaluator.md, and the harness-summary skill) — medium for routine planning and implementation, high for capability evaluation, and max for contract review and cross-sprint summary analysis.
batch.enabled — boolean. Default: false. When true and a sprint has at least batch.min_criteria criteria, eval verifications are submitted as a single Anthropic Batch API call (50% discount on input/output tokens, 24-hour SLA). Batch is a cost optimization, not a latency optimization. With the default false, evaluations run synchronously as in Phase 1.
batch.min_criteria — integer. Default: 20. Sprints with fewer criteria stay synchronous even when batch.enabled is true — the batch overhead is only worth absorbing on large suites. The criterion count compared against this threshold is the same count emitted in sprint-{NN}.tasks.json (success criteria + Should-NOT gates).
Evaluation framework skills for designing scoring rubrics, running structured evaluations on LLM outputs, and comparing candidate outputs to recommend a winner.
Analyzes and rewrites prompts for Claude Code, applying structured prompt engineering patterns to produce clearer, more effective instructions.
Anthropic's official development skills for Claude API integration, MCP server building, skill creation, web artifact building, and browser-based testing.
Observability skills for designing logging and tracing strategies, instrumenting existing code with structured log points, and analyzing trace logs to diagnose production issues.
Skills for designing and evaluating multi-agent systems: orchestrator/worker decomposition, output quality review, and self-improving evaluator/optimizer loops.
npx claudepluginhub ats-kinoshita-iso/trine-evalFeature development with code-architect/explorer/reviewer agents, CLAUDE.md audit and session learnings, and Agent Skills creation with eval benchmarking from Anthropic.
Production-grade engineering skills for AI coding agents — covering the full software development lifecycle from spec to ship.
Access thousands of AI prompts and skills directly in your AI coding assistant. Search prompts, discover skills, save your own, and improve prompts with AI.
Complete developer toolkit for Claude Code
Intelligent draw.io diagramming plugin with AI-powered diagram generation, multi-platform embedding (GitHub, Confluence, Azure DevOps, Notion, Teams, Harness), conditional formatting, live data binding, and MCP server integration for programmatic diagram creation and management.
Orchestrate multi-agent teams for parallel code review, hypothesis-driven debugging, and coordinated feature development using Claude Code's Agent Teams