From adk
Writes and runs automated conversation tests (evals) for ADK agents. Covers file format, assertions, CLI usage, and per-primitive testing patterns.
How this skill is triggered — by the user, by Claude, or both
Slash command
/adk:adk-evalsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, which workflows run, and more.
Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, which workflows run, and more.
Evals run against a live dev bot (adk dev), so they test the full stack — not mocks.
Use this skill when the developer asks about:
--format json flag, tagging strategiesOr when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.
Trigger questions:
| File | Contents |
|---|---|
references/eval-format.md | Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options |
references/testing-workflow.md | Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration |
references/test-patterns.md | Per-primitive patterns for actions, tools, workflows, conversations, and state |
eval-format.md for structure and assertionstesting-workflow.md for CLI commands and outputtest-patterns.md for the relevant sectiontesting-workflow.md (inspect traces) + eval-format.md (check assertion syntax)import { Eval } from '@botpress/evals'
export default new Eval({
name: 'greeting',
type: 'regression',
tags: ['basic'],
setup: {
state: { bot: { welcomeSent: false } },
workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
},
conversation: [
{
user: 'Hi!',
assert: {
response: [
{ not_contains: 'error' },
{ llm_judge: 'Response is friendly and offers to help' },
],
tools: [{ not_called: 'createTicket' }],
state: [{ path: 'conversation.greeted', equals: true }],
},
},
],
outcome: {
state: [{ path: 'conversation.greeted', equals: true }],
},
options: {
idleTimeout: 60000,
judgePassThreshold: 4,
},
})
| Turn | When to use |
|---|---|
user: 'message' | Standard user message |
event: { type, payload } | Non-message trigger (webhook, integration event) |
expectSilence: true | Assert bot does NOT respond |
| Category | What it checks |
|---|---|
response | Bot reply text (contains, not_contains, matches, llm_judge) |
tools | Tool calls (called, not_called, call_order, params) |
state | Bot/user/conversation state (equals, changed) |
workflow | Workflow execution (entered, completed) |
timing | Response time in ms (lte, gte) |
adk evals # run all evals
adk evals <name> # run one eval
adk evals --tag <tag> # filter by tag
adk evals --type regression # filter by type
adk evals --verbose # show all assertions
adk evals --format json # JSON output for CI
adk evals runs # list recent runs
adk evals runs --latest # most recent run
adk evals runs --latest -v # with full details
✅ Every turn needs user or event
// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }
❌ expectSilence alone is not a valid turn
// WRONG — missing user or event
{ expectSilence: true }
✅ Assert tool params to verify correct extraction
// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }
❌ Only asserting the tool was called
// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }
✅ Use outcome for post-conversation state and workflow assertions
// CORRECT — final state checked once after all turns
outcome: {
state: [{ path: 'conversation.resolved', equals: true }],
workflow: [{ name: 'ticketFlow', completed: true }],
}
✅ Seed state to test conditional behavior without running setup turns
// CORRECT — start in a known state
setup: {
state: {
user: { plan: 'pro' },
conversation: { phase: 'support' },
},
}
❌ Using conversation turns to set up state (slow and fragile)
// WRONG — depends on the bot correctly processing setup turns
conversation: [
{ user: 'I am on the pro plan' }, // hoping bot sets user.plan
{ user: 'I need help with billing' }, // actual test turn
]
Writing evals:
Running evals:
Debugging:
Per-primitive:
Match depth to the question.
Answer directly — show the relevant table or CLI command. Don't generate a full eval file for an informational question.
new Eval({}) call with realistic field valuesimport { Eval } from '@botpress/evals')adk evals <name>expected / actual diff)npx claudepluginhub botpress/skills --plugin adkRuns evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.
Systematic debugging for ADK agents — trace reading, log analysis, common failure diagnosis, and the debug loop.
Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.