From trine-eval
Import real failure cases from bug reports, incidents, and manual tests to seed the eval suite
How this skill is triggered — by the user, by Claude, or both
Slash command
/trine-eval:bootstrap-failuresThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Seed the eval suite with real failure cases instead of starting from scratch. This implements Steps 0 and 1 of Anthropic's eval methodology: "Start early with 20-50 real failure cases" and "Start with what you already test manually."
Seed the eval suite with real failure cases instead of starting from scratch. This implements Steps 0 and 1 of Anthropic's eval methodology: "Start early with 20-50 real failure cases" and "Start with what you already test manually."
Synthetic test cases miss the failure modes that matter most. Real failures — drawn from production incidents, bug reports, and manual QA — represent the actual distribution of problems your system encounters. Starting with these gives the eval suite immediate relevance and catches the issues users care about.
Gather failure cases from these sources, prioritized by user impact:
For each failure case, create an eval task entry in .harness/bootstrap/failure-catalog.json:
{
"failures": [
{
"id": "F001",
"source": "bug-report",
"source_ref": "GH-123",
"title": "Short description of the failure",
"problem": "What went wrong — the observed behavior",
"expected": "What should have happened — the correct behavior",
"success_criteria": "Specific, unambiguous criterion for verifying the fix",
"reference_solution": "Optional: known-working output or approach",
"rubric_dimension": "Which rubric dimension this maps to",
"severity": "critical | high | medium | low",
"grader_type": "behavioral | structural | llm-judge"
}
]
}
behavioral if the criterion is verified by invoking the artifact and observing the result (preferred); structural if it is verified by inspecting an artifact at rest (grep, jq, schema check); llm-judge if it requires reading comprehension or subjective assessment. Default to behavioral whenever the artifact can be executed.Start with 20-50 real failure cases. This is sufficient for early development, where each system change produces large, noticeable effects. As the system matures, grow the catalog organically from ongoing production feedback.
A 0% pass rate across many trials almost always signals a broken task, not an incapable agent — if nothing passes, review the task definitions before concluding the system is broken.
The failure catalog feeds into the existing harness workflow at two points:
.harness/ initialization)When /harness-kickoff runs, if .harness/bootstrap/failure-catalog.json exists:
critical should appear as sprint criteria in the first sprintWhen the Generator proposes sprint contracts:
success_criteria field maps directly)reference_solution from the catalog in the contract's Reference Solutions sectionReal failures (bugs, incidents, tickets)
↓ bootstrap skill (manual import)
.harness/bootstrap/failure-catalog.json
↓ kickoff reads catalog
.harness/spec.md (informed by failure patterns)
↓ planner decomposes sprints
.harness/sprints.json (prioritized by severity)
↓ generator reads catalog during contract proposal
.harness/contracts/sprint-NN.md (criteria from real failures)
↓ evaluator tests criteria
.harness/evals/sprint-NN-rR.md (real failure cases as eval tasks)
The templates/by-rubric/ subdirectory contains pre-seeded failure catalogs derived from
rubric playbook traps. Each file is named after the rubric it targets (e.g.,
templates/by-rubric/harness-build.json) and contains a failures array following the
same schema as .harness/bootstrap/failure-catalog.json.
Per-rubric templates provide a curated starting point for projects graded by a specific rubric. Rather than starting from zero failure cases, a harness-build project can begin with 12–15 trap-derived entries that cover all rubric dimensions — including the three UNCONDITIONAL gate dimensions (Control Plane & Agentic Loop, Tool Registry & Sandboxing, Governance & Human Oversight) that carry the highest risk weight and cause automatic sprint FAIL if absent.
When /harness-kickoff runs for a project and a matching per-rubric template exists, the
merge is triggered during Step 2b (failure catalog seeding). The procedure is:
harness-build).templates/by-rubric/<rubric-name>.json within the
bootstrap-failures skill directory. For a harness-build project, this is
templates/by-rubric/harness-build.json.failures array from the template file..harness/bootstrap/failure-catalog.json
if it exists. If no project catalog exists yet, proceed to step 6.id values already present in the project catalog.failures array: if the entry's
id is not already in the existing-ID set, append the entry to the project catalog. Entries
whose id is already present are skipped and not overwritten.failures array back to
.harness/bootstrap/failure-catalog.json. If no project catalog existed, this write creates
the file with the template entries as the initial catalog.Additive-merge-by-id rule: per-rubric template entries do not overwrite user-authored entries.
Any entry with an id already present in the project catalog is skipped. This ensures the merge
is idempotent — running kickoff a second time on the same project does not duplicate entries or
overwrite changes the practitioner made to the catalog after the initial seeding.
by-rubric/harness-build.json — 12–15 playbook-trap-derived entries for agent runtime harnesses,
covering all 7 harness-build rubric dimensions with at least 2 entries each for the three gate
dimensions.To bootstrap a project's eval suite:
.harness/bootstrap/ directoryfailure-catalog.json following the schema/harness-kickoff — the planner will incorporate the catalogThe bootstrap is a one-time seeding operation, but the catalog should be updated as new production failures are discovered. It is a living document, not a snapshot.
npx claudepluginhub ats-kinoshita-iso/trine-evalProvides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.