From trine-eval
Cross-sprint analysis showing pass rates, consistency metrics, trends, and failure patterns
How this skill is triggered — by the user, by Claude, or both
Slash command
/trine-eval:harness-summaryThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate a cross-sprint evaluation summary by analyzing all completed sprint evaluations.
Generate a cross-sprint evaluation summary by analyzing all completed sprint evaluations.
The frontmatter declares thinking: { type: adaptive, effort: max }. Cross-sprint analysis is the highest-leverage reasoning task in the harness: a missed pattern here propagates into every subsequent recommendation, every saturation graduation, and every consistency-metric interpretation. Because saturation detection feeds the regression suite (append-only — once a wrong criterion graduates, it gates every future sprint), and because the recommendations from this skill shape what the operator does next, an analytical mistake at summary time has the largest blast radius in the system. max matches that — it is the right place to spend deeply on reasoning even though the skill runs less frequently than per-sprint evaluation.
.harness/config.json for project context. Note config.mode (default "standard") and config.components_enabled.per_sprint_aci_review — these determine whether ACI self-optimization runs batched here or was already captured per-sprint in the eval reports..harness/progress.md for sprint completion status.harness/evals/ to collect evaluation results. Files named sprint-NN-rR.md contain per-round data; files named sprint-NN.md contain the final round's results only..harness/contracts/ to understand what was promised vs deliveredpass@k — The probability of at least one success in k attempts:
pass@k = 1 - (1 - p)^k
where p is the per-trial pass rate (passed criteria / total criteria for a single evaluation trial at fixed code state) and k is the number of trials for that sprint.
Use pass@k when one success is sufficient — e.g., a code generation tool where the user picks the best output from multiple runs. High pass@k with low pass^k indicates the system can succeed but does so inconsistently.
pass^k — The probability that all k trials succeed:
pass^k = p^k
Use pass^k when consistency is essential — e.g., a customer-facing agent where every interaction must succeed. At a 75% per-trial pass rate, pass^3 drops to approximately 42%.
How to compute from eval data (Phase 2: trial-based):
Compute pass@k and pass^k from trial files (sprint-NN-rR-tT.md), not round files (sprint-NN-rR.md).
R. Within each round, trial files are named sprint-NN-rR-tT.md (when config.trials > 1) or the single file sprint-NN-rR.md (when config.trials == 1).R_final (the round whose code represents the shipped sprint), collect the per-trial pass rates from all trial files.config.trials for that sprint (defaulting to 1).Trials measure consistency at a fixed code state (the Generator is not editing between trials), so p estimates the agent's true reliability. Retries, by contrast, change the code, so retry-round pass rates mix a fixed-bug signal into what should be a pure consistency measurement.
Deprecated (retry-derived) metric: Prior to the trial loop, pass@k and pass^k were computed from retry rounds — i.e., k was the number of retry rounds and p was their averaged pass rate. That formulation is deprecated because it treats a fixed bug as evidence of inconsistency, inflating pass@k and deflating pass^k. When rendering the summary for pre-Phase-2 sprints that have only round files, label the metric pass@rounds / pass^rounds (deprecated) and note in the summary that statistically valid pass@k/pass^k requires at least 2 trials per round.
First-round-pass rate remains a separate metric. It measures whether the Generator gets the implementation right before any retry feedback — a capability signal, not a consistency signal. Keep it in the per-sprint table as its own column.
These metrics reveal whether the system is reliable (high pass^k) or merely capable (high pass@k but low pass^k). A large gap between pass@k and pass^k signals non-determinism that needs investigation.
sprint-NN-r*.md files per sprint)A criterion is saturated when it passes on the first evaluation round across 3 or more consecutive sprints. Saturated criteria track regressions but provide no improvement signal.
Identifying saturated criteria:
Action for saturated criteria: Graduate them into the regression suite at .harness/regression/regression.json, then replace them in the next sprint contract with harder variants that push the agent's capabilities. Include specific recommendations for harder replacements in the summary.
Distinguishing easy from well-implemented: A criterion that is inherently trivial (e.g., "file exists") saturates because it is easy — it should be graduated without replacement. A criterion that was previously hard but now consistently passes saturates because the implementation improved — replace it with a harder variant targeting the same capability. Check the criterion's history: if it ever failed in prior sprints, it represents genuine capability growth. If it has never failed across any sprint, it may be too easy.
Graduation is a file-write, not a prose recommendation. For every saturated criterion identified above, append a machine-readable entry to .harness/regression/regression.json so Step 0.5 of the next sprint runs it as a regression gate. The writer logic is:
.harness/contracts/sprint-NN.tasks.json — use the task_id as the stable lookup key.regression.json's tasks array: task_id, criterion, grader_type, weight, is_gate, verification_command, and rubric_dimension are all preserved with the same values. Do not rename, paraphrase, or recompute.graduated_from_sprint: <NN>, where <NN> is the sprint whose eval first demonstrated saturation (typically the 3rd consecutive first-round-pass sprint). This preserves the audit trail — every regression entry traces back to the sprint that justified it.tasks.json schema is the direct source of record, and regression.json extends that schema with one field. There is a single pipeline: sprint contracts → tasks.json → saturation detection → regression.json → Step 0.5 gate. Operators reading the summary should see the graduation action as the natural terminus of saturation detection, not a parallel mechanism.Graduation is append-only. Never remove or rewrite an existing entry in regression.json. If a buggy summary run could mutate prior entries, a regression-coverage loss would be one bad run away — exactly the failure mode the gate exists to prevent. The summary only ever appends newly saturated criteria. If an operator needs to retire a regression criterion, they edit regression.json by hand, outside the harness.
Sprint contracts may declare an optional Edge Case Criteria section (see skills/sprint-contract/SKILL.md for the rationale and template). Edge case criteria test ambiguous, boundary, and adversarial inputs — empty inputs, very large inputs, concurrent requests, malformed payloads, queries with no matches. They are not weighted and not counted toward the 100% weighted score.
The summary reports their results separately as Edge Case Pass Rate — a distinct row in the per-sprint table and an aggregate value across sprints.
Why separate from the weighted score. Folding edge cases into the weighted total creates the one-sided-eval failure mode Anthropic's playbook calls out: an agent that only passes obvious positive cases earns the same headline score as one that also handles ambiguous inputs. Reporting Edge Case Pass Rate as its own metric makes that asymmetry visible — a sprint that scores 100% weighted but 30% on edge cases looks materially different from one with 100% weighted and 95% on edge cases.
How to compute. For each sprint, count edge_case_passed / edge_case_total over the criteria in the contract's ## Edge Case Criteria section (or 0/0 = N/A when the section is omitted). Aggregate across sprints by summing passes and totals separately, not by averaging per-sprint rates. Report N/A explicitly when no sprint declared edge-case criteria — the absence of edge cases is meaningful information, not a zero.
Per-rubric expectations. The metric is most meaningful for web-app, api-service, and rag-system projects whose rubrics carry well-known edge-case domains (browser viewport extremes, empty/oversized API payloads, queries with no matching documents). For cli-tool and eval-harness projects, Edge Case Pass Rate often shows N/A — those rubrics encode edge-case concerns inside the dimension scoring tables rather than as separate criteria.
Render in the per-sprint table. Add an Edge Case Pass Rate column to the per-sprint table; show N/A when the sprint declared no edge cases.
The cross-sprint edge case aggregate is a single number computed across every sprint that declared an ## Edge Case Criteria section. The formula is:
cross-sprint edge-case pass rate = total edge-case PASS / total edge-case criteria
where total edge-case PASS sums the PASS counts across every contributing sprint and total edge-case criteria sums the totals. Sprints that omit the edge-case section contribute neither to the numerator nor the denominator. When no sprint has declared edge cases, render the aggregate as N/A, matching the per-sprint convention.
Why summing rather than averaging. A sprint with 1 edge-case criterion (1/1 PASS) and a sprint with 20 edge-case criteria (10/20 PASS) average to 75% if you average per-sprint rates, but the true aggregate is 11/21 = 52%. Averaging per-sprint rates over-weights sprints that declared few edge cases. Summing passes and totals separately preserves the rate's meaning across sprint sizes — the aggregate answers "of every edge case ever evaluated, what fraction passed" rather than "what is the mean per-sprint edge-case rate."
Where to render. Add the aggregate as a new line beneath the per-sprint Edge Case Pass Rate column, in the Overview section: Cross-sprint edge-case pass rate: P/T = X% (or N/A). The aggregator script tests/edge-case-aggregate.py computes this value from the fixture project; the production summary computes it identically from the parent .harness/contracts/ and .harness/evals/ trees.
Sprint contracts may declare an optional Functional Smoke section (see skills/sprint-contract/SKILL.md). Functional Smoke criteria exercise the deliverable against real external systems (live Anthropic API, real Docker, real filesystem, real judge model) and inform the Functional Integration Coverage rubric dimension. They are not weighted and not counted toward the 100% weighted score.
The summary reports their results separately as Functional Smoke Pass Rate — a distinct column in the per-sprint table and an aggregate value across sprints. This is structurally identical to Edge Case Pass Rate, but answers a different question: does the code work end-to-end against real systems? rather than does the code handle ambiguous inputs?
Why separate from the weighted score. Mocked architectural tests can pass at 100% on code that fails the moment it touches a real API — wrong cache_control key shape, batch demux that doesn't match the real custom_id echo, judge prompts the real model interprets differently. Folding functional smoke into the weighted total would let a sprint claim "100% PASS" on the mocked surface while shipping broken integration. Reporting Functional Smoke Pass Rate as its own metric makes the architectural/functional gap visible, and the Functional Integration Coverage rubric dimension consumes both the pass rate and structural indicators (env-var gating, budget cap, fixture parity).
How to compute. For each sprint, count functional_smoke_passed / functional_smoke_total over the criteria in the contract's ## Functional Smoke section (or 0/0 = N/A when the section is omitted). Aggregate across sprints by summing passes and totals separately, not by averaging per-sprint rates — same rule as edge cases. Report N/A explicitly when no sprint declared functional smoke criteria.
Per-rubric expectations. Most meaningful for projects whose deliverables integrate with paid or external systems (api-service, anything using anthropic, anything using Docker). For pure documentation, pure refactor, or pure-internal sprints, N/A is the expected reading.
Render in the per-sprint table. Add a Functional Smoke Pass Rate column to the per-sprint table; show N/A when the sprint declared no functional smoke criteria. The column sits to the right of Edge Case Pass Rate — same column class (optional, not weighted).
cross-sprint functional smoke pass rate = total smoke PASS / total smoke criteria
Sum passes and totals separately across every sprint that declared ## Functional Smoke. Sprints that omit the section contribute neither to numerator nor denominator. Render as N/A when no sprint has declared functional smoke. Place the aggregate beneath the per-sprint column in the Overview section: Cross-sprint functional smoke pass rate: P/T = X% (or N/A). The summing rule is the same one defended at length under Edge Case Pass Rate above — averaging per-sprint rates would over-weight sprints with few smoke criteria.
Cost reporting. Alongside the pass rate, surface the per-sprint and cross-sprint live-API spend (USD). Source the cost from the runner's EvalLog.metadata — each smoke run records its actual cost_usd via the existing Opus 4.7 pricing path in src/trine_eval/runner/engine.py. A sprint that comes in at $0.94 (under the $1.00 cap) is fine; a sprint that hits the cap should be flagged in ## Recommendations as a candidate for fixture conversion.
components_enabled config)When rendering a FAIL criterion entry or a grader-disagreement entry in the summary output, link the corresponding structured transcript at .harness/transcripts/sprint-NN-rR-tT.json (multi-trial) or .harness/transcripts/sprint-NN-rR.json (single-trial). The transcript pairs 1:1 with the eval markdown and contains the structured record described in rules/harness-conventions.md under Transcript Schema — messages, tool_calls, token_usage, timing, and thinking_summary for that run.
Why FAIL and disagreement entries specifically — and not every PASS row. PASS verdicts on behavioral or structural criteria are usually self-explanatory: the verification command exited 0, and that exit code is the answer. A human auditor scanning the summary does not typically need the transcript for those rows. FAIL entries and grader disagreements are the points where the verdict alone is insufficient — the auditor needs to see what tools the Evaluator called, what evidence it weighed, and how it got from observation to conclusion. Linking transcripts only at those entries keeps the summary readable while preserving audit access exactly where it matters; linking every row would dilute the signal and turn the summary into a wall of paths.
How to render the link. Add a Transcript: line under the FAIL criterion's evidence (or the grader-disagreement entry), pointing to the transcript file with a relative path. Example for a multi-trial FAIL:
### Sprint 4, Criterion 8 — FAIL
**Evidence:** PostToolUse hook only echoed; did not update `sprint-state.json`.
**Transcript:** `.harness/transcripts/sprint-04-r1-t1.json`
For single-trial mode, the transcript path is .harness/transcripts/sprint-NN-rR.json. If the transcript file does not exist (the evaluator did not emit a parseable trailer for that run), omit the line — do NOT print a broken link. Transcripts are append-only-when-available, so missing transcripts are a known and acceptable state.
Grader-disagreement entries include any criterion where a code-based grader and the LLM-judge would have produced different verdicts (per the Evaluator's ## Human Review Flags section in the eval report). These are the calibration touch-points where the structured transcript matters most — the FAIL/PASS gap reflects rubric ambiguity, and the transcript captures the reasoning that produced each verdict.
Write the summary to .harness/summary.md:
# Eval Summary
**Mode:** {config.mode} <!-- "standard" or "minimal"; omit the line if the field is absent for backward compat -->
## Overview
- Sprints completed: X
- Overall pass rate: Y%
- Overall weighted pass rate: W%
- Average rounds per sprint: Z
## Consistency Metrics
| Sprint | p (avg) | k (rounds) | pass@k | pass^k |
|--------|---------|------------|--------|--------|
| 1 | 0.85 | 2 | 97.8% | 72.3% |
- Overall pass@k: {value}
- Overall pass^k: {value}
- Consistency gap (pass@k - pass^k): {value} — {interpretation}
## Per-Sprint Results
| Sprint | Title | Verdict | Rounds | Pass Rate | Weighted | pass^k | Edge Case Pass Rate | Functional Smoke Pass Rate |
|--------|-------|---------|--------|-----------|----------|--------|---------------------|----------------------------|
| 1 | ... | PASS | 2 | 85% | 87% | 72% |
## Trend Analysis
{Description of trends including consistency trends}
## Common Failure Patterns
{Ranked list of recurring issues}
## Saturation & Regression Graduation
| Criterion Type | Consecutive First-Round Passes | Status | Recommendation |
|---------------|-------------------------------|--------|----------------|
| File exists | 5 | Saturated (easy) | Graduate without replacement |
| Error handling| 3 | Saturated (improved) | Replace with concurrency edge cases |
## Recommendations
{Actionable suggestions for next sprints, including consistency improvements and graduation actions}
## Tool & Skill Description Improvements
{ACI self-optimization recommendations from eval transcript analysis — see below}
Also print the summary to the user for immediate review.
After generating the summary, review eval transcripts to identify improvements to tool and skill descriptions. This implements the playbook's guidance that "tool design is an eval target itself" and that agents optimizing tool descriptions can produce improvements "beyond expert human-written implementations."
Mode handling:
config.components_enabled.per_sprint_aci_review is true (standard mode default): each eval report has already been reviewed for grader-quality issues by the Evaluator itself. Here, surface those per-sprint findings — pull the Transcript Review observations from each sprint-NN-rR.md and consolidate into the summary's "Tool & Skill Description Improvements" section.config.components_enabled.per_sprint_aci_review is false (minimal mode default): per-sprint Transcript Review was skipped to save tokens. Perform a single batched review across all .harness/evals/*.md files here instead. This is cheaper than per-sprint review because repeated patterns are only flagged once.The extraction process below applies in both cases; batched mode just processes all evals in one pass.
Read through eval reports (.harness/evals/sprint-NN-rR.md) looking for:
behavioral whose evidence was actually file-reading (or one tagged structural that secretly demanded execution, or one tagged either that genuinely required LLM judgment) suggests the verification method in the contract was misspecified. Mislabeled behavioral criteria are the most common failure mode: they look strong on paper but accept structural evidence at grading time.For each identified issue:
To ensure improvements actually help rather than introducing new problems:
Only apply changes that improve held-out cases without causing regressions.
npx claudepluginhub ats-kinoshita-iso/trine-evalProvides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.