From adk-evaluation
Use this skill to generate ADK 2.0 .evalset.json files — test cases for the `adk eval` CLI. Triggers on: "ADK eval set", "ADK evalset.json", "create test cases for ADK agent", "ADK eval file", "agent test fixtures", "adk eval test data", "evaluation dataset ADK". Generates a properly structured eval set with expected tool calls, expected outputs, and metadata ready for `adk eval`.
How this skill is triggered — by the user, by Claude, or both
Slash command
/adk-evaluation:eval-set-generatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate `.evalset.json` files for `adk eval`. Each entry is a test case the agent must handle correctly.
Generate .evalset.json files for adk eval. Each entry is a test case the agent must handle correctly.
{
"eval_set_id": "weather_assistant_basic_001",
"name": "Weather Assistant — basic queries",
"description": "Smoke tests for single-turn weather queries.",
"eval_cases": [
{
"eval_id": "weather_tokyo",
"conversation": [
{
"user_content": { "parts": [{ "text": "What's the weather in Tokyo?" }] },
"final_response": { "parts": [{ "text": "..." }] },
"intermediate_data": {
"tool_uses": [
{ "name": "get_weather", "args": { "city": "Tokyo" } }
]
}
}
],
"session_input": {
"app_name": "weather_app",
"user_id": "test_user",
"state": {}
}
}
]
}
For every agent, generate these case types:
| Type | Example | Why |
|---|---|---|
| Happy path | Standard expected query | Baseline coverage |
| Edge — empty/null | "What's the weather?" (no city) | Missing param handling |
| Edge — invalid | "Weather in Atlantis" | Graceful unknown handling |
| Multi-turn | Follow-up referencing prior turn | Context retention |
| Tool-required | Query that MUST trigger a tool | Tool routing |
| Tool-forbidden | Query that should NOT trigger a tool | Avoid over-tool-use |
| Adversarial | Prompt injection attempts | Safety |
# scripts/gen_evalset.py
import json
from pathlib import Path
cases = []
for city in ["Tokyo", "Paris", "Atlantis"]:
cases.append({
"eval_id": f"weather_{city.lower()}",
"conversation": [{
"user_content": {"parts": [{"text": f"Weather in {city}?"}]},
"intermediate_data": {
"tool_uses": [{"name": "get_weather", "args": {"city": city}}]
},
}],
"session_input": {"app_name": "weather", "user_id": "t1", "state": {}},
})
eval_set = {
"eval_set_id": "weather_basic_001",
"name": "Weather basic queries",
"eval_cases": cases,
}
Path("weather_basic.evalset.json").write_text(json.dumps(eval_set, indent=2))
adk evaladk eval ./agent.py ./weather_basic.evalset.json
adk eval ./agent.py ./weather_basic.evalset.json --num_runs 3
eval_id is unique within a setintermediate_data.tool_uses matches actual tool names exactlysession_input.state reflects realistic starting statecustom-metric-builder for scoring beyond default similarityuser-simulation-runner for multi-turn dynamic evalsCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub healthcare-ai-consulting-llc/adk-2-toolkit --plugin adk-evaluation