From noticed-labs
Improve agent eval scores using a structured benchmark loop. Use when eval scenarios fail, judge scores are low, onboarding quality regresses, or when iterating on system prompts, persona overlays, extraction prompts, or onboarding instructions. Covers both onboarding evals (scripted messages, 6 dimensions) and agent live evals (user-sim LLM, mission-driven criteria).
How this skill is triggered — by the user, by Claude, or both
Slash command
/noticed-labs:agent-eval-labThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill to systematically improve agent eval scores. Same discipline as query-optimization-lab: measure, hypothesize, change one thing, re-measure, keep/discard.
Use this skill to systematically improve agent eval scores. Same discipline as query-optimization-lab: measure, hypothesize, change one thing, re-measure, keep/discard.
THE JUDGE IS IMMUTABLE. You improve the agent's behavior, not the grading. Never modify the judge rubric, scoring thresholds, or assertion helpers to make scores go up.
This lab covers two complementary eval systems:
# run all onboarding evals
npm run eval:live
# run single scenario (fast iteration)
npm run eval:live -- <scenario-name>
apps/noticed-agent/__tests__/evals/onboarding/apps/noticed-agent/__tests__/evals/helpers/apps/noticed-agent/scripts/benchmark-reports/onboarding-eval-<timestamp>.csv# run all agent live evals
npm run eval:agent
# with verbose transcript output
EVAL_VERBOSE=1 npm run eval:agent
apps/noticed-agent/__tests__/evals/agent-eval/apps/noticed-agent/__tests__/evals/agent-eval/missions/apps/noticed-agent/__tests__/evals/agent-eval/fixtures/apps/noticed-agent/__tests__/evals/agent-eval/runner.tsapps/noticed-agent/__tests__/evals/agent-eval/user-simulator.tsapps/noticed-agent/__tests__/evals/agent-eval/judge.tsapps/noticed-agent/vitest.live.config.tsANTHROPIC_API_KEY set in environment or apps/noticed-agent/.envPrimary: judge_score (1-5 overall)
Dimensions: naturalness, progression, factExtraction, personaConsistency, toolUse, accountAwareness
Pass thresholds:
Correctness checks:
completed_steps — must match expected completion for the scenarioremaining_steps — only steps genuinely blocked by the scenariopersona — must match user's selection| scenario pattern | expected completed | acceptable remaining |
|---|---|---|
| no accounts connected | persona, welcome, identity, profile, preferences | github_link, linkedin_connect, linkedin_export |
| github connected | above + github_link | linkedin_connect, linkedin_export |
| linkedin connected | above + linkedin_connect | github_link, linkedin_export |
| both connected | above + github_link, linkedin_connect | linkedin_export |
Primary: overallScore (0-10)
Criteria: Per-mission success criteria, each scored pass/fail with reasoning
Pass thresholds:
Correctness checks:
criteriaResults — each criterion has pass/fail + reasoningcompletedSteps — at least the expected steps for the mission's starting statefinalPersona — a persona was selectedqualitativeNotes — strengths/weaknesses/unexpected behaviors loggedKeep when:
Discard when:
completed_steps changed unexpectedlyLLM evals are non-deterministic. Agent live evals have higher variance than onboarding evals because both the user messages and agent responses are LLM-generated.
After each iteration, log:
## Scenario
<which eval scenario was targeted>
## Judge Issues
<verbatim issues from the CSV>
## Hypothesis
<one sentence>
## Change
<file edited and what changed>
## Benchmark
- before: judge_score=X | dim1=X, dim2=X, ...
- after: judge_score=X | dim1=X, dim2=X, ...
- delta: +/-X overall | dimension deltas
- completion: N/M steps -> N/M steps
## Decision
keep | discard
## Next Step
<run full suite | iterate on same scenario | move to next weakest>
## Mission
<mission id>
## Failed Criteria
<list of criteria that failed with judge reasoning>
## Hypothesis
<one sentence>
## Change
<file edited and what changed>
## Benchmark
- before: overallScore=X | N/M criteria passed | steps: [list]
- after: overallScore=X | N/M criteria passed | steps: [list]
- weaknesses before: ...
- weaknesses after: ...
## Decision
keep | discard
## Next Step
<run full suite | iterate on same mission | add new mission>
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
npx claudepluginhub noticedso/noticed-labs --plugin noticed-labs