Skill

adk-evals

From adk

Writes and runs automated conversation tests (evals) for ADK agents. Covers file format, assertions, CLI usage, and per-primitive testing patterns.

testing

ai-ml

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/adk:adk-evals

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, which workflows run, and more.

Supporting Files

references/eval-format.mdreferences/test-patterns.mdreferences/testing-workflow.md

SKILL.md

245 lines · ~1.8k tokens

Stats

LanguageShell

Stars12

Forks1

MaintenanceExcellent

Last CommitMay 28, 2026

Actions

View Source View Plugin View on GitHub View README

ADK Evals Skill

What are Evals?

Evals run against a live dev bot (adk dev), so they test the full stack — not mocks.

When to Use This Skill

Use this skill when the developer asks about:

Writing evals — file format, assertions, turn types, setup
Running evals — CLI commands, filtering, output interpretation
Testing specific primitives — how to test actions, tools, workflows, conversations, state
The testing loop — write → run → inspect traces → iterate
CI integration — exit codes, --format json flag, tagging strategies
Eval configuration — idleTimeout, judgePassThreshold, judgeModel

Or when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.

Trigger questions:

"How do I write an eval?"
"How do I test my workflow?"
"How do I assert that a tool was called with specific params?"
"My eval is failing, how do I debug it?"
"How do I test that the bot stays silent?"
"How do I run evals in CI?"
"How do I seed state before an eval?"
"How do I trigger a workflow in an eval?"

Available Documentation

File	Contents
`references/eval-format.md`	Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options
`references/testing-workflow.md`	Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration
`references/test-patterns.md`	Per-primitive patterns for actions, tools, workflows, conversations, and state

How to Answer

Writing an eval → Read eval-format.md for structure and assertions
Running evals → Read testing-workflow.md for CLI commands and output
Testing a specific primitive → Read test-patterns.md for the relevant section
Debugging a failure → Combine testing-workflow.md (inspect traces) + eval-format.md (check assertion syntax)

Quick Reference

Eval file structure

import { Eval } from '@botpress/evals'

export default new Eval({
  name: 'greeting',
  type: 'regression',
  tags: ['basic'],

  setup: {
    state: { bot: { welcomeSent: false } },
    workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
  },

  conversation: [
    {
      user: 'Hi!',
      assert: {
        response: [
          { not_contains: 'error' },
          { llm_judge: 'Response is friendly and offers to help' },
        ],
        tools: [{ not_called: 'createTicket' }],
        state: [{ path: 'conversation.greeted', equals: true }],
      },
    },
  ],

  outcome: {
    state: [{ path: 'conversation.greeted', equals: true }],
  },

  options: {
    idleTimeout: 60000,
    judgePassThreshold: 4,
  },
})

Turn types

Turn	When to use
`user: 'message'`	Standard user message
`event: { type, payload }`	Non-message trigger (webhook, integration event)
`expectSilence: true`	Assert bot does NOT respond

Assertion categories

Category	What it checks
`response`	Bot reply text (contains, not_contains, matches, llm_judge)
`tools`	Tool calls (called, not_called, call_order, params)
`state`	Bot/user/conversation state (equals, changed)
`workflow`	Workflow execution (entered, completed)
`timing`	Response time in ms (lte, gte)

CLI commands

adk evals                        # run all evals
adk evals <name>                 # run one eval
adk evals --tag <tag>            # filter by tag
adk evals --type regression      # filter by type
adk evals --verbose              # show all assertions
adk evals --format json          # JSON output for CI

adk evals runs                   # list recent runs
adk evals runs --latest          # most recent run
adk evals runs --latest -v       # with full details

Critical Patterns

✅ Every turn needs user or event

// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }

❌ expectSilence alone is not a valid turn

// WRONG — missing user or event
{ expectSilence: true }

✅ Assert tool params to verify correct extraction

// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }

❌ Only asserting the tool was called

// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }

✅ Use outcome for post-conversation state and workflow assertions

// CORRECT — final state checked once after all turns
outcome: {
  state: [{ path: 'conversation.resolved', equals: true }],
  workflow: [{ name: 'ticketFlow', completed: true }],
}

✅ Seed state to test conditional behavior without running setup turns

// CORRECT — start in a known state
setup: {
  state: {
    user: { plan: 'pro' },
    conversation: { phase: 'support' },
  },
}

❌ Using conversation turns to set up state (slow and fragile)

// WRONG — depends on the bot correctly processing setup turns
conversation: [
  { user: 'I am on the pro plan' },      // hoping bot sets user.plan
  { user: 'I need help with billing' },   // actual test turn
]

Example Questions

Writing evals:

"Write an eval that tests my createTicket tool is called with the right priority"
"How do I assert that the bot stays silent after an internal event?"
"How do I test a multi-turn conversation where context is retained?"

Running evals:

"How do I run only regression evals?"
"How do I see which assertions failed and why?"
"How do I integrate evals into GitHub Actions?"

Debugging:

"My eval says the tool wasn't called but I think it was — how do I check?"
"How do I inspect what the bot actually did during an eval?"

Per-primitive:

"How do I test a workflow that uses step.sleep()?"
"How do I test that state changed from the seeded value?"

Response Format

Match depth to the question.

Simple questions ("what assertions are available?", "how do I run evals?")

Answer directly — show the relevant table or CLI command. Don't generate a full eval file for an informational question.

Writing an eval

Show the complete new Eval({}) call with realistic field values
Include imports (import { Eval } from '@botpress/evals')
Briefly explain non-obvious assertions — skip if the assertion is self-explanatory
Suggest the CLI command to run it: adk evals <name>

Debugging a failing eval

Ask for or show the failing assertion (expected / actual diff)
Suggest opening traces in the Dev Console to see what the bot did
Identify whether the issue is in the eval assertion or the bot's behavior

adk-evals

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

adk-evals

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

ADK Evals Skill

What are Evals?

When to Use This Skill

Available Documentation

How to Answer

Quick Reference

Eval file structure

Turn types

Assertion categories

CLI commands

Critical Patterns

Example Questions

Response Format

Simple questions ("what assertions are available?", "how do I run evals?")

Writing an eval

Debugging a failing eval

Similar Skills

ADK Evals Skill

What are Evals?

When to Use This Skill

Available Documentation

How to Answer

Quick Reference

Eval file structure

Turn types

Assertion categories

CLI commands

Critical Patterns

Example Questions

Response Format

Simple questions ("what assertions are available?", "how do I run evals?")

Writing an eval

Debugging a failing eval

Similar Skills