Skill

agent-eval-lab

Improve agent eval scores using a structured benchmark loop. Use when eval scenarios fail, judge scores are low, onboarding quality regresses, or when iterating on system prompts, persona overlays, extraction prompts, or onboarding instructions. Covers both onboarding evals (scripted messages, 6 dimensions) and agent live evals (user-sim LLM, mission-driven criteria).

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/noticed-labs:agent-eval-lab

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Use this skill to systematically improve agent eval scores. Same discipline as query-optimization-lab: measure, hypothesize, change one thing, re-measure, keep/discard.

Supporting Files

EXPERIMENTS.mdPATTERNS.mdWORKFLOW.md

SKILL.md

217 lines · ~1.9k tokens

Stats

Stars0

MaintenanceGood

Last CommitApr 27, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Agent Eval Lab

Use this skill to systematically improve agent eval scores. Same discipline as query-optimization-lab: measure, hypothesize, change one thing, re-measure, keep/discard.

THE JUDGE IS IMMUTABLE. You improve the agent's behavior, not the grading. Never modify the judge rubric, scoring thresholds, or assertion helpers to make scores go up.

Two Eval Types

This lab covers two complementary eval systems:

Onboarding Evals (scripted)

User messages: Pre-scripted in fixture files
Judge: 6 fixed dimensions (naturalness, progression, factExtraction, personaConsistency, toolUse, accountAwareness), scored 1-5
Output: CSV reports
Use for: Regression testing onboarding quality with deterministic inputs

Agent Live Evals (autonomous)

User messages: Generated by a user-simulator LLM that roleplays a fixture profile (e.g., Sarah Chen)
Judge: Mission-driven with custom success criteria, per-criterion pass/fail, scored 0-10
Output: JSON reports + console output
Use for: Testing full agent behavior with realistic, dynamic conversations

Core Loop

Read the latest eval output — find the weakest scenario (lowest score or most failed criteria)
Read the judge issues — understand exactly what the judge flagged
Trace to source — find the file responsible (see file map in WORKFLOW.md)
Form one hypothesis — one sentence: what's wrong and why
Make one change — edit exactly one file
Re-run that single scenario
Compare before/after — did scores improve? did any dimension/criterion regress?
Keep/discard — keep if improved and no regressions; discard and revert otherwise
Once scenario passes, run full suite — confirm no regressions

Repo Defaults

Onboarding evals

# run all onboarding evals
npm run eval:live

# run single scenario (fast iteration)
npm run eval:live -- <scenario-name>

Eval files: apps/noticed-agent/__tests__/evals/onboarding/
Helpers: apps/noticed-agent/__tests__/evals/helpers/
CSV reports: apps/noticed-agent/scripts/benchmark-reports/onboarding-eval-<timestamp>.csv

Agent live evals

# run all agent live evals
npm run eval:agent

# with verbose transcript output
EVAL_VERBOSE=1 npm run eval:agent

Eval files: apps/noticed-agent/__tests__/evals/agent-eval/
Missions: apps/noticed-agent/__tests__/evals/agent-eval/missions/
Fixtures: apps/noticed-agent/__tests__/evals/agent-eval/fixtures/
Runner: apps/noticed-agent/__tests__/evals/agent-eval/runner.ts
User simulator: apps/noticed-agent/__tests__/evals/agent-eval/user-simulator.ts
Judge: apps/noticed-agent/__tests__/evals/agent-eval/judge.ts

Common

Vitest config: apps/noticed-agent/vitest.live.config.ts
Requires: ANTHROPIC_API_KEY set in environment or apps/noticed-agent/.env

Metrics

Onboarding evals

Primary: judge_score (1-5 overall)

Dimensions: naturalness, progression, factExtraction, personaConsistency, toolUse, accountAwareness

Pass thresholds:

overall score >= 3.5
no single dimension <= 2

Correctness checks:

completed_steps — must match expected completion for the scenario
remaining_steps — only steps genuinely blocked by the scenario
persona — must match user's selection

Expected completion per scenario type

scenario pattern	expected completed	acceptable remaining
no accounts connected	persona, welcome, identity, profile, preferences	github_link, linkedin_connect, linkedin_export
github connected	above + github_link	linkedin_connect, linkedin_export
linkedin connected	above + linkedin_connect	github_link, linkedin_export
both connected	above + github_link, linkedin_connect	linkedin_export

Agent live evals

Primary: overallScore (0-10)

Criteria: Per-mission success criteria, each scored pass/fail with reasoning

Pass thresholds:

overall score >= 5
at least 50% of criteria pass
at least 2 onboarding steps completed

Correctness checks:

criteriaResults — each criterion has pass/fail + reasoning
completedSteps — at least the expected steps for the mission's starting state
finalPersona — a persona was selected
qualitativeNotes — strengths/weaknesses/unexpected behaviors logged

Key differences from onboarding evals

No fixed dimensions — criteria are defined per mission
User messages are non-deterministic (LLM-generated), so higher variance
Conversation length varies (up to turnBudget, ~20-25 turns)
Tests the full pipeline including tool dispatch and workspace writes

Keep/Discard Rules

Keep when:

target scenario's score improves
no dimension/criterion that was previously passing now fails
full suite re-run shows no other scenario regressed
completion checks still pass
improvement holds on 2 consecutive runs

Discard when:

score doesn't improve or gets worse
another scenario regresses
fix is cosmetic (cleaner prompt but same score)
score only improves on one lucky run
completed_steps changed unexpectedly

Flakiness Protocol

LLM evals are non-deterministic. Agent live evals have higher variance than onboarding evals because both the user messages and agent responses are LLM-generated.

reliably passing: passes on 2 out of 2 consecutive runs
reliably failing: fails on 2 out of 2 consecutive runs
mixed: run a third time — majority wins
never ship a fix based on a single passing run after a failure
for agent live evals, expect more variance — a score swing of +/- 1.5 is normal; focus on criterion pass/fail stability, not exact scores

Output Format

After each iteration, log:

For onboarding evals

## Scenario
<which eval scenario was targeted>

## Judge Issues
<verbatim issues from the CSV>

## Hypothesis
<one sentence>

## Change
<file edited and what changed>

## Benchmark
- before: judge_score=X | dim1=X, dim2=X, ...
- after:  judge_score=X | dim1=X, dim2=X, ...
- delta:  +/-X overall | dimension deltas
- completion: N/M steps -> N/M steps

## Decision
keep | discard

## Next Step
<run full suite | iterate on same scenario | move to next weakest>

For agent live evals

## Mission
<mission id>

## Failed Criteria
<list of criteria that failed with judge reasoning>

## Hypothesis
<one sentence>

## Change
<file edited and what changed>

## Benchmark
- before: overallScore=X | N/M criteria passed | steps: [list]
- after:  overallScore=X | N/M criteria passed | steps: [list]
- weaknesses before: ...
- weaknesses after: ...

## Decision
keep | discard

## Next Step
<run full suite | iterate on same mission | add new mission>

References

WORKFLOW.md — detailed step-by-step with file map
PATTERNS.md — proven prompt/extraction fixes
EXPERIMENTS.md — discipline for LLM non-determinism

agent-eval-lab

Invocation

Context Preview

Supporting Files

SKILL.md

agent-eval-lab

Invocation

Context Preview

Supporting Files

SKILL.md

Agent Eval Lab

Two Eval Types

Onboarding Evals (scripted)

Agent Live Evals (autonomous)

Core Loop

Repo Defaults

Onboarding evals

Agent live evals

Common

Metrics

Onboarding evals

Expected completion per scenario type

Agent live evals

Key differences from onboarding evals

Keep/Discard Rules

Flakiness Protocol

Output Format

For onboarding evals

For agent live evals

References

Similar Skills

Agent Eval Lab

Two Eval Types

Onboarding Evals (scripted)

Agent Live Evals (autonomous)

Core Loop

Repo Defaults

Onboarding evals

Agent live evals

Common

Metrics

Onboarding evals

Expected completion per scenario type

Agent live evals

Key differences from onboarding evals

Keep/Discard Rules

Flakiness Protocol

Output Format

For onboarding evals

For agent live evals

References

Similar Skills