environment-simulation | adk-evaluation

Stats

Actions

Tags

environment-simulation | adk-evaluation

environment-simulation

Stub out the world around your agent so eval runs are deterministic, fast, and free.

When to use

Tools hit paid APIs you don't want to charge during CI
Tools mutate real systems (sending emails, charging cards)
Need deterministic responses for snapshot testing
Reproducing a specific bug requires a controlled clock / DB state

Mock tool fixtures

# tests/fixtures/mock_tools.py
from datetime import datetime

class FakeWeatherAPI:
    def __init__(self):
        self.calls = []
        self.responses = {
            "Tokyo": {"temperature": 68, "conditions": "clear"},
            "Paris": {"temperature": 55, "conditions": "rain"},
        }

    def get_weather(self, city: str, units: str = "fahrenheit") -> dict:
        self.calls.append((city, units))
        if city not in self.responses:
            return {"error": f"Unknown city: {city}"}
        return {**self.responses[city], "city": city, "units": units}

Wire into an eval harness

import pytest
from google.adk.agents import LlmAgent

@pytest.fixture
def fake_weather():
    return FakeWeatherAPI()

@pytest.fixture
def agent_under_test(fake_weather):
    return LlmAgent(
        name="weather_assistant",
        model="gemini-2.5-flash",
        instruction="Answer weather questions.",
        tools=[fake_weather.get_weather],
    )

async def test_known_city(agent_under_test, fake_weather):
    result = await agent_under_test.run("Weather in Tokyo?")
    assert "68" in result or "clear" in result
    assert ("Tokyo", "fahrenheit") in fake_weather.calls

Frozen clock

from freezegun import freeze_time

@freeze_time("2026-01-15 09:00:00")
async def test_morning_briefing(agent_under_test):
    result = await agent_under_test.run("What's today?")
    assert "January 15, 2026" in result or "2026-01-15" in result

Mock LLM (for fully deterministic tests)

from google.adk.testing import StubLlm

stub = StubLlm(responses=[
    {"text": "I'll check the weather.", "tool_calls": [{"name": "get_weather", "args": {"city": "Tokyo"}}]},
    {"text": "It's 68°F and clear in Tokyo."},
])

agent = LlmAgent(
    name="test",
    model=stub,
    tools=[fake_weather.get_weather],
)

Snapshot testing

import json
from pathlib import Path

async def test_snapshot(agent_under_test, snapshot):
    transcript = await run_eval_case(agent_under_test, "weather_tokyo")
    expected = json.loads(Path("snapshots/weather_tokyo.json").read_text())
    assert transcript == expected

Validation

Tests run with no network (pytest --forbid-network or similar)
Deterministic across runs (same outputs for same inputs)
Mock state resets between tests (use pytest fixtures, not module globals)
Snapshot diffs are reviewable (small, text-based)

See also

user-simulation-runner to drive the agent inside the simulated environment
custom-metric-builder to score behavior in the simulation