Skill

harness-summary

From trine-eval

Cross-sprint analysis showing pass rates, consistency metrics, trends, and failure patterns

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/trine-eval:harness-summary

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadGlobGrep

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Generate a cross-sprint evaluation summary by analyzing all completed sprint evaluations.

SKILL.md

268 lines · ~6.2k tokens(exceeds 5k compaction limit)

Stats

LanguagePython

Parent stars0

MaintenanceExcellent

Last CommitJun 8, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Eval Summary

Generate a cross-sprint evaluation summary by analyzing all completed sprint evaluations.

Thinking Effort

The frontmatter declares thinking: { type: adaptive, effort: max }. Cross-sprint analysis is the highest-leverage reasoning task in the harness: a missed pattern here propagates into every subsequent recommendation, every saturation graduation, and every consistency-metric interpretation. Because saturation detection feeds the regression suite (append-only — once a wrong criterion graduates, it gates every future sprint), and because the recommendations from this skill shape what the operator does next, an analytical mistake at summary time has the largest blast radius in the system. max matches that — it is the right place to spend deeply on reasoning even though the skill runs less frequently than per-sprint evaluation.

How to Generate

Read .harness/config.json for project context. Note config.mode (default "standard") and config.components_enabled.per_sprint_aci_review — these determine whether ACI self-optimization runs batched here or was already captured per-sprint in the eval reports.
Read .harness/progress.md for sprint completion status
Read all files in .harness/evals/ to collect evaluation results. Files named sprint-NN-rR.md contain per-round data; files named sprint-NN.md contain the final round's results only.
Read all files in .harness/contracts/ to understand what was promised vs delivered

What to Compute

Pass Rate

Overall: (total passed criteria) / (total criteria) across all sprints
Per-sprint: pass rate for each individual sprint
Weighted pass rate: sum of passed criteria weights / 100% per sprint (if contracts use weighted criteria)

Consistency Metrics

pass@k — The probability of at least one success in k attempts:

pass@k = 1 - (1 - p)^k

where p is the per-trial pass rate (passed criteria / total criteria for a single evaluation trial at fixed code state) and k is the number of trials for that sprint.

Use pass@k when one success is sufficient — e.g., a code generation tool where the user picks the best output from multiple runs. High pass@k with low pass^k indicates the system can succeed but does so inconsistently.

pass^k — The probability that all k trials succeed:

pass^k = p^k

Use pass^k when consistency is essential — e.g., a customer-facing agent where every interaction must succeed. At a 75% per-trial pass rate, pass^3 drops to approximately 42%.

How to compute from eval data (Phase 2: trial-based):

Compute pass@k and pass^k from trial files (sprint-NN-rR-tT.md), not round files (sprint-NN-rR.md).

For each sprint, group eval files by round R. Within each round, trial files are named sprint-NN-rR-tT.md (when config.trials > 1) or the single file sprint-NN-rR.md (when config.trials == 1).
For the most recent round R_final (the round whose code represents the shipped sprint), collect the per-trial pass rates from all trial files.
Compute p as the average per-trial pass rate across those trials. If only one trial exists (single-trial mode), p is simply the round's pass rate and pass@1 = pass^1 = p.
k = config.trials for that sprint (defaulting to 1).
Report both pass@k and pass^k per sprint and overall.

Trials measure consistency at a fixed code state (the Generator is not editing between trials), so p estimates the agent's true reliability. Retries, by contrast, change the code, so retry-round pass rates mix a fixed-bug signal into what should be a pure consistency measurement.

Deprecated (retry-derived) metric: Prior to the trial loop, pass@k and pass^k were computed from retry rounds — i.e., k was the number of retry rounds and p was their averaged pass rate. That formulation is deprecated because it treats a fixed bug as evidence of inconsistency, inflating pass@k and deflating pass^k. When rendering the summary for pre-Phase-2 sprints that have only round files, label the metric pass@rounds / pass^rounds (deprecated) and note in the summary that statistically valid pass@k/pass^k requires at least 2 trials per round.

First-round-pass rate remains a separate metric. It measures whether the Generator gets the implementation right before any retry feedback — a capability signal, not a consistency signal. Keep it in the per-sprint table as its own column.

These metrics reveal whether the system is reliable (high pass^k) or merely capable (high pass@k but low pass^k). A large gap between pass@k and pass^k signals non-determinism that needs investigation.

Trend Analysis

Is the pass rate improving or degrading across sprints?
Are retry counts increasing or decreasing?
Is the first-round pass rate improving? (This indicates the Generator is learning from prior eval feedback)
Is pass^k improving? (This indicates the system is becoming more consistent, not just more capable)

Failure Patterns

Which rubric dimensions fail most often?
Are the same types of issues recurring? (e.g., always failing on error handling, always failing on responsive design)
Which criteria required the most retries?

Retry Efficiency

Average rounds per sprint (count sprint-NN-r*.md files per sprint)
Cost trajectory: are later rounds cheaper than earlier ones? Compare criteria fail counts across rounds within each sprint. (If not improving, feedback specificity may need improvement)
First-round vs final-round delta: how many criteria were fixed by retries?

Saturation & Regression Graduation

A criterion is saturated when it passes on the first evaluation round across 3 or more consecutive sprints. Saturated criteria track regressions but provide no improvement signal.

Identifying saturated criteria:

For each criterion type that appears across sprints, check whether it passed in round 1 of every sprint
If it has passed on first attempt for 3+ consecutive sprints, flag it as saturated

Action for saturated criteria: Graduate them into the regression suite at .harness/regression/regression.json, then replace them in the next sprint contract with harder variants that push the agent's capabilities. Include specific recommendations for harder replacements in the summary.

Distinguishing easy from well-implemented: A criterion that is inherently trivial (e.g., "file exists") saturates because it is easy — it should be graduated without replacement. A criterion that was previously hard but now consistently passes saturates because the implementation improved — replace it with a harder variant targeting the same capability. Check the criterion's history: if it ever failed in prior sprints, it represents genuine capability growth. If it has never failed across any sprint, it may be too easy.

Graduation is a file-write, not a prose recommendation. For every saturated criterion identified above, append a machine-readable entry to .harness/regression/regression.json so Step 0.5 of the next sprint runs it as a regression gate. The writer logic is:

Locate the source entry in the producing sprint's .harness/contracts/sprint-NN.tasks.json — use the task_id as the stable lookup key.
Copy that entry verbatim into regression.json's tasks array: task_id, criterion, grader_type, weight, is_gate, verification_command, and rubric_dimension are all preserved with the same values. Do not rename, paraphrase, or recompute.
Add one new field to the copied entry: graduated_from_sprint: <NN>, where <NN> is the sprint whose eval first demonstrated saturation (typically the 3rd consecutive first-round-pass sprint). This preserves the audit trail — every regression entry traces back to the sprint that justified it.
Positioning: regression is the downstream product of the same saturation detection documented above — not a new hand-curated list. The Sprint 6 tasks.json schema is the direct source of record, and regression.json extends that schema with one field. There is a single pipeline: sprint contracts → tasks.json → saturation detection → regression.json → Step 0.5 gate. Operators reading the summary should see the graduation action as the natural terminus of saturation detection, not a parallel mechanism.

Graduation is append-only. Never remove or rewrite an existing entry in regression.json. If a buggy summary run could mutate prior entries, a regression-coverage loss would be one bad run away — exactly the failure mode the gate exists to prevent. The summary only ever appends newly saturated criteria. If an operator needs to retire a regression criterion, they edit regression.json by hand, outside the harness.

Edge Case Pass Rate

Sprint contracts may declare an optional Edge Case Criteria section (see skills/sprint-contract/SKILL.md for the rationale and template). Edge case criteria test ambiguous, boundary, and adversarial inputs — empty inputs, very large inputs, concurrent requests, malformed payloads, queries with no matches. They are not weighted and not counted toward the 100% weighted score.

The summary reports their results separately as Edge Case Pass Rate — a distinct row in the per-sprint table and an aggregate value across sprints.

Why separate from the weighted score. Folding edge cases into the weighted total creates the one-sided-eval failure mode Anthropic's playbook calls out: an agent that only passes obvious positive cases earns the same headline score as one that also handles ambiguous inputs. Reporting Edge Case Pass Rate as its own metric makes that asymmetry visible — a sprint that scores 100% weighted but 30% on edge cases looks materially different from one with 100% weighted and 95% on edge cases.

How to compute. For each sprint, count edge_case_passed / edge_case_total over the criteria in the contract's ## Edge Case Criteria section (or 0/0 = N/A when the section is omitted). Aggregate across sprints by summing passes and totals separately, not by averaging per-sprint rates. Report N/A explicitly when no sprint declared edge-case criteria — the absence of edge cases is meaningful information, not a zero.

Per-rubric expectations. The metric is most meaningful for web-app, api-service, and rag-system projects whose rubrics carry well-known edge-case domains (browser viewport extremes, empty/oversized API payloads, queries with no matching documents). For cli-tool and eval-harness projects, Edge Case Pass Rate often shows N/A — those rubrics encode edge-case concerns inside the dimension scoring tables rather than as separate criteria.

Render in the per-sprint table. Add an Edge Case Pass Rate column to the per-sprint table; show N/A when the sprint declared no edge cases.

Cross-sprint edge case aggregation

The cross-sprint edge case aggregate is a single number computed across every sprint that declared an ## Edge Case Criteria section. The formula is:

cross-sprint edge-case pass rate = total edge-case PASS / total edge-case criteria

where total edge-case PASS sums the PASS counts across every contributing sprint and total edge-case criteria sums the totals. Sprints that omit the edge-case section contribute neither to the numerator nor the denominator. When no sprint has declared edge cases, render the aggregate as N/A, matching the per-sprint convention.

Why summing rather than averaging. A sprint with 1 edge-case criterion (1/1 PASS) and a sprint with 20 edge-case criteria (10/20 PASS) average to 75% if you average per-sprint rates, but the true aggregate is 11/21 = 52%. Averaging per-sprint rates over-weights sprints that declared few edge cases. Summing passes and totals separately preserves the rate's meaning across sprint sizes — the aggregate answers "of every edge case ever evaluated, what fraction passed" rather than "what is the mean per-sprint edge-case rate."

Where to render. Add the aggregate as a new line beneath the per-sprint Edge Case Pass Rate column, in the Overview section: Cross-sprint edge-case pass rate: P/T = X% (or N/A). The aggregator script tests/edge-case-aggregate.py computes this value from the fixture project; the production summary computes it identically from the parent .harness/contracts/ and .harness/evals/ trees.

Functional Smoke Pass Rate

Sprint contracts may declare an optional Functional Smoke section (see skills/sprint-contract/SKILL.md). Functional Smoke criteria exercise the deliverable against real external systems (live Anthropic API, real Docker, real filesystem, real judge model) and inform the Functional Integration Coverage rubric dimension. They are not weighted and not counted toward the 100% weighted score.

The summary reports their results separately as Functional Smoke Pass Rate — a distinct column in the per-sprint table and an aggregate value across sprints. This is structurally identical to Edge Case Pass Rate, but answers a different question: does the code work end-to-end against real systems? rather than does the code handle ambiguous inputs?

Why separate from the weighted score. Mocked architectural tests can pass at 100% on code that fails the moment it touches a real API — wrong cache_control key shape, batch demux that doesn't match the real custom_id echo, judge prompts the real model interprets differently. Folding functional smoke into the weighted total would let a sprint claim "100% PASS" on the mocked surface while shipping broken integration. Reporting Functional Smoke Pass Rate as its own metric makes the architectural/functional gap visible, and the Functional Integration Coverage rubric dimension consumes both the pass rate and structural indicators (env-var gating, budget cap, fixture parity).

How to compute. For each sprint, count functional_smoke_passed / functional_smoke_total over the criteria in the contract's ## Functional Smoke section (or 0/0 = N/A when the section is omitted). Aggregate across sprints by summing passes and totals separately, not by averaging per-sprint rates — same rule as edge cases. Report N/A explicitly when no sprint declared functional smoke criteria.

Per-rubric expectations. Most meaningful for projects whose deliverables integrate with paid or external systems (api-service, anything using anthropic, anything using Docker). For pure documentation, pure refactor, or pure-internal sprints, N/A is the expected reading.

Render in the per-sprint table. Add a Functional Smoke Pass Rate column to the per-sprint table; show N/A when the sprint declared no functional smoke criteria. The column sits to the right of Edge Case Pass Rate — same column class (optional, not weighted).

Cross-sprint functional smoke aggregation

cross-sprint functional smoke pass rate = total smoke PASS / total smoke criteria

Sum passes and totals separately across every sprint that declared ## Functional Smoke. Sprints that omit the section contribute neither to numerator nor denominator. Render as N/A when no sprint has declared functional smoke. Place the aggregate beneath the per-sprint column in the Overview section: Cross-sprint functional smoke pass rate: P/T = X% (or N/A). The summing rule is the same one defended at length under Edge Case Pass Rate above — averaging per-sprint rates would over-weight sprints with few smoke criteria.

Cost reporting. Alongside the pass rate, surface the per-sprint and cross-sprint live-API spend (USD). Source the cost from the runner's EvalLog.metadata — each smoke run records its actual cost_usd via the existing Opus 4.7 pricing path in src/trine_eval/runner/engine.py. A sprint that comes in at $0.94 (under the $1.00 cap) is fine; a sprint that hits the cap should be flagged in ## Recommendations as a candidate for fixture conversion.

Recommendations

Based on patterns, what should the next sprint focus on?
Are there systemic issues that rubric changes could address?
Should any harness components be disabled based on performance? (per the components_enabled config)
Which criteria should be graduated from capability eval to regression suite?
Where is the largest gap between pass@k and pass^k? (indicates where to invest in consistency)
Is Edge Case Pass Rate consistently below the weighted score? (indicates one-sided optimization — the agent passes obvious cases but skids on ambiguous ones)
Is Functional Smoke Pass Rate consistently below the weighted score, or is the Functional Integration Coverage rubric stuck at 2/5 or below? (indicates architectural-only optimization — the agent passes mocked tests but its code may not work against real systems)

Transcript Links for FAIL Criteria and Grader Disagreements

When rendering a FAIL criterion entry or a grader-disagreement entry in the summary output, link the corresponding structured transcript at .harness/transcripts/sprint-NN-rR-tT.json (multi-trial) or .harness/transcripts/sprint-NN-rR.json (single-trial). The transcript pairs 1:1 with the eval markdown and contains the structured record described in rules/harness-conventions.md under Transcript Schema — messages, tool_calls, token_usage, timing, and thinking_summary for that run.

Why FAIL and disagreement entries specifically — and not every PASS row. PASS verdicts on behavioral or structural criteria are usually self-explanatory: the verification command exited 0, and that exit code is the answer. A human auditor scanning the summary does not typically need the transcript for those rows. FAIL entries and grader disagreements are the points where the verdict alone is insufficient — the auditor needs to see what tools the Evaluator called, what evidence it weighed, and how it got from observation to conclusion. Linking transcripts only at those entries keeps the summary readable while preserving audit access exactly where it matters; linking every row would dilute the signal and turn the summary into a wall of paths.

How to render the link. Add a Transcript: line under the FAIL criterion's evidence (or the grader-disagreement entry), pointing to the transcript file with a relative path. Example for a multi-trial FAIL:

### Sprint 4, Criterion 8 — FAIL
**Evidence:** PostToolUse hook only echoed; did not update `sprint-state.json`.
**Transcript:** `.harness/transcripts/sprint-04-r1-t1.json`

For single-trial mode, the transcript path is .harness/transcripts/sprint-NN-rR.json. If the transcript file does not exist (the evaluator did not emit a parseable trailer for that run), omit the line — do NOT print a broken link. Transcripts are append-only-when-available, so missing transcripts are a known and acceptable state.

Grader-disagreement entries include any criterion where a code-based grader and the LLM-judge would have produced different verdicts (per the Evaluator's ## Human Review Flags section in the eval report). These are the calibration touch-points where the structured transcript matters most — the FAIL/PASS gap reflects rubric ambiguity, and the transcript captures the reasoning that produced each verdict.

Output Format

Write the summary to .harness/summary.md:

# Eval Summary

**Mode:** {config.mode}  <!-- "standard" or "minimal"; omit the line if the field is absent for backward compat -->

## Overview
- Sprints completed: X
- Overall pass rate: Y%
- Overall weighted pass rate: W%
- Average rounds per sprint: Z

## Consistency Metrics
| Sprint | p (avg) | k (rounds) | pass@k | pass^k |
|--------|---------|------------|--------|--------|
| 1      | 0.85    | 2          | 97.8%  | 72.3%  |

- Overall pass@k: {value}
- Overall pass^k: {value}
- Consistency gap (pass@k - pass^k): {value} — {interpretation}

## Per-Sprint Results
| Sprint | Title | Verdict | Rounds | Pass Rate | Weighted | pass^k | Edge Case Pass Rate | Functional Smoke Pass Rate |
|--------|-------|---------|--------|-----------|----------|--------|---------------------|----------------------------|
| 1      | ...   | PASS    | 2      | 85%       | 87%      | 72%    |

## Trend Analysis
{Description of trends including consistency trends}

## Common Failure Patterns
{Ranked list of recurring issues}

## Saturation & Regression Graduation
| Criterion Type | Consecutive First-Round Passes | Status | Recommendation |
|---------------|-------------------------------|--------|----------------|
| File exists   | 5                             | Saturated (easy) | Graduate without replacement |
| Error handling| 3                             | Saturated (improved) | Replace with concurrency edge cases |

## Recommendations
{Actionable suggestions for next sprints, including consistency improvements and graduation actions}

## Tool & Skill Description Improvements
{ACI self-optimization recommendations from eval transcript analysis — see below}

Also print the summary to the user for immediate review.

ACI Self-Optimization from Eval Transcripts

After generating the summary, review eval transcripts to identify improvements to tool and skill descriptions. This implements the playbook's guidance that "tool design is an eval target itself" and that agents optimizing tool descriptions can produce improvements "beyond expert human-written implementations."

Mode handling:

When config.components_enabled.per_sprint_aci_review is true (standard mode default): each eval report has already been reviewed for grader-quality issues by the Evaluator itself. Here, surface those per-sprint findings — pull the Transcript Review observations from each sprint-NN-rR.md and consolidate into the summary's "Tool & Skill Description Improvements" section.
When config.components_enabled.per_sprint_aci_review is false (minimal mode default): per-sprint Transcript Review was skipped to save tokens. Perform a single batched review across all .harness/evals/*.md files here instead. This is cheaper than per-sprint review because repeated patterns are only flagged once.

The extraction process below applies in both cases; batched mode just processes all evals in one pass.

Extract Feedback from Eval Transcripts

Read through eval reports (.harness/evals/sprint-NN-rR.md) looking for:

Tool calls that failed or produced unexpected results — the tool description may have been ambiguous or missing critical context
Criteria where the grader type was wrong — a criterion tagged behavioral whose evidence was actually file-reading (or one tagged structural that secretly demanded execution, or one tagged either that genuinely required LLM judgment) suggests the verification method in the contract was misspecified. Mislabeled behavioral criteria are the most common failure mode: they look strong on paper but accept structural evidence at grading time.
Evaluator misinterpretations — where the evaluator tested something different from what the criterion intended, often because the skill/agent description was unclear
Recurring failures across sprints — the same type of failure appearing in multiple sprints may indicate a systemic description gap rather than an implementation issue

Improvement Process

For each identified issue:

Locate the relevant tool/skill description — the agent markdown file, skill SKILL.md, or rubric that was involved
Propose a description change — make the description more specific, add context that was missing, or clarify ambiguous language. Follow ACI best practices: 3-4 sentences per tool description, meaningful names, explicit context about specialized terminology
Apply the change to the markdown file
Document the change in the summary under "Tool & Skill Description Improvements" with the rationale

Validation Against Held-Out Cases

To ensure improvements actually help rather than introducing new problems:

Identify held-out eval cases — select 2-3 prior sprint evals that were NOT used to derive the description changes
Re-evaluate mentally — would the updated descriptions have changed any grades in those held-out cases? Would they have prevented any misinterpretations?
Check for regressions — could the new wording cause false positives or false negatives on cases that previously graded correctly?

Only apply changes that improve held-out cases without causing regressions.

harness-summary

Invocation

Tool Access

Context Preview

SKILL.md

harness-summary

Invocation

Tool Access

Context Preview

SKILL.md

Eval Summary

Thinking Effort

How to Generate

What to Compute

Pass Rate

Consistency Metrics

Trend Analysis

Failure Patterns

Retry Efficiency

Saturation & Regression Graduation

Edge Case Pass Rate

Cross-sprint edge case aggregation

Functional Smoke Pass Rate

Cross-sprint functional smoke aggregation

Recommendations

Transcript Links for FAIL Criteria and Grader Disagreements

Output Format

ACI Self-Optimization from Eval Transcripts

Extract Feedback from Eval Transcripts

Improvement Process

Validation Against Held-Out Cases

Similar Skills

Eval Summary

Thinking Effort

How to Generate

What to Compute

Pass Rate

Consistency Metrics

Trend Analysis

Failure Patterns

Retry Efficiency

Saturation & Regression Graduation

Edge Case Pass Rate

Cross-sprint edge case aggregation

Functional Smoke Pass Rate

Cross-sprint functional smoke aggregation

Recommendations

Transcript Links for FAIL Criteria and Grader Disagreements

Output Format

ACI Self-Optimization from Eval Transcripts

Extract Feedback from Eval Transcripts

Improvement Process

Validation Against Held-Out Cases

Similar Skills