user-simulation-runner | adk-evaluation

Stats

Actions

Tags

user-simulation-runner | adk-evaluation

user-simulation-runner

Run synthetic users against your ADK 2.0 agent to test multi-turn behavior. An LLM plays each user with a goal, persona, and termination criterion.

When to use

Multi-turn flows (booking, troubleshooting) where static .evalset.json is too rigid
Stress testing with diverse personas
Discovering edge cases an authored test set misses
Validating robustness to realistic user phrasing

User simulator template

from google.adk.agents import LlmAgent

def make_user_simulator(persona: str, goal: str, max_turns: int = 10):
    return LlmAgent(
        name="simulated_user",
        model="gemini-2.5-flash",
        instruction=(
            f"You are roleplaying a user. Persona: {persona}\n"
            f"Goal: {goal}\n"
            f"Stop after {max_turns} turns or when your goal is met. "
            "Speak naturally — DO NOT mention you are an AI or that this is a simulation. "
            "Output ONLY what the user would type. When done, output exactly 'END'."
        ),
    )

Harness

import asyncio
from google.adk.runners import Runner

async def simulate(system_agent, user_agent, max_turns=10):
    transcript = []
    user_msg = await user_agent.first_turn()  # opener
    for _ in range(max_turns):
        if user_msg.strip() == "END":
            break
        transcript.append(("user", user_msg))
        sys_resp = await system_agent.run(user_msg)
        transcript.append(("system", sys_resp))
        user_msg = await user_agent.next(system_response=sys_resp)
    return transcript

# Run N personas
personas = [
    {"persona": "elderly first-time user", "goal": "book a doctor's appointment"},
    {"persona": "frustrated expert", "goal": "cancel a subscription"},
    {"persona": "non-native English speaker", "goal": "reset password"},
]

async def run_all():
    results = []
    for p in personas:
        user = make_user_simulator(**p)
        transcript = await simulate(system_under_test, user)
        results.append({"persona": p["persona"], "transcript": transcript})
    return results

results = asyncio.run(run_all())

Goal-completion scoring

After each simulation, run a judge:

async def score_goal_completion(transcript, goal):
    judge = LiteLlm(model="gemini-2.5-pro")
    prompt = (
        f"Goal: {goal}\nTranscript:\n{format_transcript(transcript)}\n"
        "Was the goal achieved? Output JSON: {achieved: bool, reasoning: str}."
    )
    return await judge.complete(prompt)

Persona library

Maintain personas.json:

[
  {"id": "p001", "persona": "elderly first-time user", "goal": "..."},
  {"id": "p002", "persona": "frustrated expert", "goal": "..."}
]

Iterate on it as bugs surface — bug → new persona → eval coverage.

Validation

User simulator stays in character (test by reading transcripts)
END token reliably emitted when goal met
Max-turn cap fires for stuck conversations
Goal-completion judge agrees with human review on a sample

See also

environment-simulation for stateful environments (mock APIs, mock DBs)
custom-metric-builder to score along your rubric