eval-set-generator | adk-evaluation

Stats

Actions

Tags

eval-set-generator | adk-evaluation

eval-set-generator

Generate .evalset.json files for adk eval. Each entry is a test case the agent must handle correctly.

File structure

{
  "eval_set_id": "weather_assistant_basic_001",
  "name": "Weather Assistant — basic queries",
  "description": "Smoke tests for single-turn weather queries.",
  "eval_cases": [
    {
      "eval_id": "weather_tokyo",
      "conversation": [
        {
          "user_content": { "parts": [{ "text": "What's the weather in Tokyo?" }] },
          "final_response": { "parts": [{ "text": "..." }] },
          "intermediate_data": {
            "tool_uses": [
              { "name": "get_weather", "args": { "city": "Tokyo" } }
            ]
          }
        }
      ],
      "session_input": {
        "app_name": "weather_app",
        "user_id": "test_user",
        "state": {}
      }
    }
  ]
}

Cases to include

For every agent, generate these case types:

Type	Example	Why
Happy path	Standard expected query	Baseline coverage
Edge — empty/null	"What's the weather?" (no city)	Missing param handling
Edge — invalid	"Weather in Atlantis"	Graceful unknown handling
Multi-turn	Follow-up referencing prior turn	Context retention
Tool-required	Query that MUST trigger a tool	Tool routing
Tool-forbidden	Query that should NOT trigger a tool	Avoid over-tool-use
Adversarial	Prompt injection attempts	Safety

Generator script

# scripts/gen_evalset.py
import json
from pathlib import Path

cases = []
for city in ["Tokyo", "Paris", "Atlantis"]:
    cases.append({
        "eval_id": f"weather_{city.lower()}",
        "conversation": [{
            "user_content": {"parts": [{"text": f"Weather in {city}?"}]},
            "intermediate_data": {
                "tool_uses": [{"name": "get_weather", "args": {"city": city}}]
            },
        }],
        "session_input": {"app_name": "weather", "user_id": "t1", "state": {}},
    })

eval_set = {
    "eval_set_id": "weather_basic_001",
    "name": "Weather basic queries",
    "eval_cases": cases,
}

Path("weather_basic.evalset.json").write_text(json.dumps(eval_set, indent=2))

Run with `adk eval`

adk eval ./agent.py ./weather_basic.evalset.json
adk eval ./agent.py ./weather_basic.evalset.json --num_runs 3

Output

Per-case pass/fail
Tool-call match accuracy
Final-response similarity score
Latency stats

Validation

Each eval_id is unique within a set
intermediate_data.tool_uses matches actual tool names exactly
session_input.state reflects realistic starting state
Run once to confirm cases load before committing

See also

custom-metric-builder for scoring beyond default similarity
user-simulation-runner for multi-turn dynamic evals