From arcforge
Measures whether skills, agents, or workflows change AI agent behavior. Run baseline vs treatment trials, grade outputs, and track regressions before shipping changes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/arcforge:arc-evaluatingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Measure whether skills, agents, and workflows actually change AI agent behavior. Define scenarios, prepare environments, run trials, grade results, track regressions.
agents/eval-analyzer.mdagents/eval-blind-comparator.mdagents/eval-grader.mdagents/skill-grader.mddashboard/__tests__/eval-dashboard.test.jsdashboard/__tests__/ui-test-plan.mddashboard/eval-dashboard-ui.htmldashboard/eval-dashboard.jsevals/evals.jsonreferences/audit-workflow.mdreferences/cli-and-metrics.mdreferences/common-mistakes-catalog.mdreferences/eval-schemas.mdreferences/grading-and-execution.mdreferences/preflight.mdreferences/verdict-policy.mdMeasure whether skills, agents, and workflows actually change AI agent behavior. Define scenarios, prepare environments, run trials, grade results, track regressions.
Core principle: "Unit tests for AI agent behavior" — if you can't measure improvement, you can't ship with confidence.
Key distinction: You are evaluating AI agents (LLM + tools), not just LLM text output. Agents use tools, read files, search codebases. Your eval environment must account for this.
Eval is required when:
Not required when: the change has no behavioral footprint (reformatting, typos, metadata-only edits). When in doubt, run the eval — it is cheaper than shipping a regression.
Does skill X change agent behavior?
delta (improvement between baseline and treatment)Does agent Y produce correct output?
pass@k (reliability across k trials)Does the full toolkit system improve agent outcomes?
delta, pass^k for critical pathsUnlike skill evals (which vary the prompt), workflow evals vary the environment while keeping the identical prompt for both conditions.
Before designing any scenario, confirm scope:
| Question | Scope | What Varies | Primary Signal |
|---|---|---|---|
| Does this instruction change agent behavior? | skill | Skill present vs absent | delta |
| Can this agent complete the task correctly? | agent | Trial-to-trial execution | pass@k, pass^k |
| Does the toolkit improve outcomes? | workflow | Bare agent vs full toolkit | delta, pass^k |
| Does this component work correctly? | none | N/A | Use unit/integration tests |
Do NOT proceed to scenario design until you can answer question 2 in one sentence.
1. Preflight → validate scenario is still discriminative
2. Define eval → scenario + assertions + grader type
3. Prepare env → setup the trial environment (files, tools, context)
4. Run eval → spawn agent with scenario, capture transcript
5. Grade eval → code grader, model grader, or human grader
6. Track results→ pass@k metric over time (JSONL)
7. Report → SHIP / NEEDS WORK / BLOCKED / INSUFFICIENT_DATA
REQUIRED BACKGROUND: references/preflight.md — ceiling threshold (0.8), PASS/BLOCK semantics, scenario hash mechanics.
REQUIRED BACKGROUND: references/verdict-policy.md — full verdict enum (SHIP, NEEDS WORK, BLOCKED, IMPROVED, REGRESSED, NO_CHANGE, INSUFFICIENT_DATA), why k<5 triggers INSUFFICIENT_DATA, asymmetric delta thresholds.
Before writing assertions, complete this checklist:
Scenario validity rules:
arc eval ab owns the A/B loop — it runs the same single-condition scenario twice.See references/grading-and-execution.md for environment setup, trial execution, isolation mechanics, and result tracking. See references/cli-and-metrics.md for CLI commands, metrics, and the scenario template.
Three graders: code (deterministic checks), model (intent/quality/reasoning), human (audience-dependent taste or domain expertise). Match grader to assertion nature — not convenience. For discipline-skill compliance, agents/skill-grader.md also extracts and classifies rationalizations.
Grader selection principle: Structured output (JSON, typed fields) does not make semantic quality deterministic. An agent can return valid JSON while producing poor analysis. Code-grade structure; model-grade quality.
Model/human grader calibration: One vague model-grader preference is not release evidence. For semantic release claims, use a task-derived rubric with anchors, repeated trials, CI/variance/agreement checks, and blind comparison, human spot-check, or independent adjudication. Treat model-grader output as noisy semantic evidence, not deterministic proof.
Deterministic proxy warning: Keyword, regex, and JSON-schema checks cover facts/fields, not critique quality. If a proxy can pass a shallow or adversarial answer, tighten it with negative fixtures/traps or model/human-grade the quality claim.
Report behavior separately from operational cost. A treatment can be correct but slower, more verbose, or pricier. Preserve duration/token/cost deltas when available, and do not hide operational regressions behind a passing behavioral verdict.
| Verdict | Meaning |
|---|---|
| SHIP | Code-graded: pass rate = 100%. Model-graded: CI95 lower bound ≥ 0.8 |
| NEEDS WORK | 60% ≤ pass rate < SHIP threshold |
| BLOCKED | pass rate < 60% |
| INSUFFICIENT_DATA | k < 5 — CI95 cannot be computed. Run more trials. |
Full verdict semantics in references/verdict-policy.md.
When pressure builds to skip or shortcut eval, these rationalizations surface. Each is a blocker in disguise.
| Excuse | Reality |
|---|---|
| "This change is too small to eval" | Size does not predict behavioral impact. A one-line prompt change can flip a verdict. Run eval — it takes minutes. |
| "Time pressure, ship now and eval later" | Eval done after shipping is a postmortem, not a gate. Ship with evidence or do not ship. |
| "Preflight blocks — I'll skip it just this once" | Preflight blocked because the scenario is no longer discriminative. Bypassing it means you cannot measure anything. Redesign the scenario. |
| "k=4 is close enough to 5" | The CI95 requires k ≥ 5 to be statistically meaningful. k=4 produces INSUFFICIENT_DATA. Run one more trial. |
| "INSUFFICIENT_DATA is advisory — I'll ship anyway" | INSUFFICIENT_DATA means you have no valid statistical basis for a verdict. Shipping on INSUFFICIENT_DATA is shipping blind. |
| "The grader raised weak_assertions but the pass rate is fine" | weak_assertions signal the assertions are not testing the right thing. A passing score on a poorly designed assertion proves nothing. Redesign the assertion. |
REQUIRED BACKGROUND: references/audit-workflow.md — how promotion and retirement arbitration works for discovered_claims and weak_assertions.
Every listed thought means stop, re-read the skill, do not proceed.
Top mistakes that waste the most eval runs. Full catalog in references/common-mistakes-catalog.md.
| Mistake | What Happens | Fix |
|---|---|---|
| Scenario before question | Mixing adherence, correctness, and toolkit effects in one noisy test | State the question first: behavior change, task outcome, or toolkit effect |
| Baseline already near ceiling | Both conditions pass, delta stays tiny | Run 2-3 pilot trials first; if baseline exceeds ~0.8, redesign |
| Skill formalizes behavior agent already exhibits | A/B delta is zero — behavior is generic competence, not skill-specific | Ask "would baseline behave differently without this skill?" If no, use workflow or agent eval |
| Prompt leaks the repair pattern | Baseline follows the template and scores high without the skill | Remove explicit grader split or named repair structure from the prompt |
| Code-grading skill adherence via competence proxy | Both conditions pass, delta is zero | Mentally run the code grader against a bare agent — if it still passes, the artifact isn't discriminative |
Using --skill-file for workflow eval | Varies the prompt instead of the environment | Workflow A/B varies the environment — use eval ab <name> without --skill-file |
| Workflow eval with no plugins installed | Baseline and treatment are identical, delta is always 0 | Ensure toolkit plugin is installed: claude plugin list should show active plugins |
Before:
After:
evals/benchmarks/latest.jsonNumeric vs qualitative analysis: Numeric comparison (delta, CI, verdict) is programmatic — the harness computes it. The eval-analyzer agent adds qualitative analysis for model/human-graded A/B results; it does not replace the programmatic verdict.
Reference files:
npx claudepluginhub gregoryho/arcforge --plugin arcforgeGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.