Stats

Actions

Available In

Tags

agent-harness-kit

The infrastructure layer that makes AI agents production-ready.

Solo-dev harness engineering kit for Claude Code, with an experimental Codex-readable runtime surface. One command, ~30 minutes, and your hobby project gets the patterns that took OpenAI from prototype to 1M lines of agent-generated code: layered architecture, structural tests, garbage collection, review subagents, JSON feature tracking, and pre-completion checklists — without the enterprise overhead.

The Harness Engineering Shift

February 2026: OpenAI published "Harness engineering: leveraging Codex in an agent-first world" documenting how their Frontier Product Exploration team built an internal product with ~1 million lines of code over 5 months — with zero lines manually written by humans.

The results:

3 engineers → 7 engineers

~1,500 PRs merged (3.5 PRs per engineer per day)

Each engineer operating at 3-10x capacity through agent delegation

Agents running autonomously for 6+ hours per task

~1 billion tokens processed per day

The insight: The work shifted from writing code to engineering the harness — the infrastructure, constraints, and feedback loops that make agents reliable at scale.

March 2026: LangChain demonstrated this principle empirically. By improving their agent harness alone (no model changes), they jumped from 52.8% → 66.5% on Terminal-Bench 2.0, climbing 25 spots on the leaderboard.

The pattern is clear: Harness quality matters more than model choice for production outcomes.

Why This Kit Exists

You're a solo developer or small team. You don't have OpenAI's infrastructure budget or Stripe's agent platform team. But you can adopt the same patterns at hobby-project scale:

What you get:

Proven patterns from production harnesses — OpenAI's two-fold initializer/coding-agent split, Anthropic's CLAUDE.md table-of-contents approach, Mitchell Hashimoto's "engineer the harness" discipline

33 skills that codify rituals from teams shipping agent-generated code at scale (/add-feature, /context-query, /garbage-collection, /remember-project, /project-status, /review-this-pr, etc.)

10 read-only review subagents for cheap second-opinion passes and mandatory done-claim advice (advisor, architecture, security, reliability, performance, API consistency, trace failure, eval rubric, adapter compatibility, release readiness)

Structural enforcement via TypeScript, Python, Go, Rust, Swift, and Kotlin adapters — catch layer violations and high-risk shortcuts before they compound

Architecture fitness plugins — repo-local JSON rules for env, DB, provider, validation, and public-API boundaries with reviewer routing

Policy packs — stack-specific governance defaults for nextjs-saas, api-backend, and python-data

Cost guardrails and attribution — default budget plus provider-call cost by skill, task, and cache read/write bucket

Model routing evidence — lane-level model usage report so cheap explore lanes and stronger implementation/review lanes are measured, not guessed

Sanitized trace corpus — public success/failure traces for tiny, normal, high-risk, false-done, overbroad-edit, reviewer-gap, replay, bypass, and runtime-parity cases

JSON feature tracking (not Markdown) — Anthropic's pattern for machine-readable planning

Task contracts + evidence bundles — a feature can only move to passes: true when the current diff has machine-readable proof, concrete checks, and a diff summary

SQLite operational state — local harness.db records intake, stories, decisions, backlog, traces, friction, and trace quality without hand-editing Markdown tables

Context rules + trace scoring — phase-by-lane retrieval guidance plus minimal/standard/detailed trace quality gates for tiny, normal, and high-risk work

Orchestration contracts — multi-agent runs bind lanes, tool policy, required reviewers, task ids, and output artifacts to a checked workflow contract

Failure-to-rule records — every recurring agent miss can be captured as JSON and promoted into a durable harness prevention

Adversarial eval suite — deterministic red-team probes for fake evidence, missing high-risk attestation, protected-path bypasses, unsafe eval commands, unreviewed bypasses, and prompt-level hook bypass attempts

Pre-completion checklists — OpenAI's golden-principles garbage collection ritual, scaled to top-3 fixes per week