agent-optimization-loop | adk-evaluation

Stats

Actions

Tags

agent-optimization-loop | adk-evaluation

agent-optimization-loop

Iteratively improve an ADK 2.0 agent against an eval set. Mutate → evaluate → keep best → repeat.

When to use

Eval scores are below target and you've exhausted manual tuning
Want systematic improvement instead of vibe-driven prompt edits
Have a stable eval set and metric to optimize against
Multiple plausible instruction variants

Optimization template

import asyncio
from google.adk.agents import LlmAgent
from google.adk.evaluation import EvalRunner

BASE_INSTRUCTION = "You are a weather assistant. Answer concisely."

async def evaluate(instruction: str, eval_set_path: str) -> float:
    agent = LlmAgent(
        name="candidate",
        model="gemini-2.5-flash",
        instruction=instruction,
        tools=[get_weather],
    )
    runner = EvalRunner(agent=agent, metrics=[default_metric])
    report = await runner.run(eval_set_path)
    return report.aggregate_score

async def propose_variants(current: str, weak_cases: list) -> list[str]:
    """Use an LLM to propose 3 instruction tweaks given failure modes."""
    judge = LiteLlm(model="gemini-2.5-pro")
    prompt = (
        f"Current instruction:\n{current}\n\n"
        f"These cases failed:\n{format_cases(weak_cases)}\n\n"
        "Propose 3 alternative instructions that might fix the failures. "
        "Output JSON: {variants: [str, str, str]}."
    )
    out = await judge.complete(prompt)
    return parse_variants(out)

async def optimize(eval_set: str, max_iters: int = 5):
    current = BASE_INSTRUCTION
    best_score = await evaluate(current, eval_set)
    history = [(current, best_score)]

    for i in range(max_iters):
        weak = await get_failing_cases(current, eval_set)
        if not weak:
            break
        variants = await propose_variants(current, weak)
        scores = await asyncio.gather(*(evaluate(v, eval_set) for v in variants))
        best_idx = max(range(len(scores)), key=lambda i: scores[i])
        if scores[best_idx] > best_score:
            current = variants[best_idx]
            best_score = scores[best_idx]
            history.append((current, best_score))
            print(f"Iter {i}: new best {best_score:.3f}")
        else:
            print(f"Iter {i}: no improvement; stopping.")
            break

    return current, best_score, history

best_inst, best_score, history = asyncio.run(optimize("./eval.evalset.json"))

What to optimize

Knob	When to vary
Instruction text	Most impactful; vary first
Model	If quality plateaus on flash, try pro
Tool order	Sometimes affects which tools the LLM picks
Temperature	For creative vs deterministic tasks
Few-shot examples	When edge cases keep failing

Train/test split

Don't optimize on your test set:

# Split eval set 60/40
train_set, test_set = split_eval_set("./full.evalset.json", train_fraction=0.6)
# Optimize on train, report final score on test
optimize(train_set)
final_score = await evaluate(best_instruction, test_set)

Cost guardrails

MAX_COST_USD = 5.00
total_cost = 0
async def evaluate_with_budget(instruction, eval_set):
    nonlocal total_cost
    if total_cost > MAX_COST_USD:
        raise BudgetExceeded()
    score, cost = await _evaluate(instruction, eval_set)
    total_cost += cost
    return score

Validation

Final score on held-out test set ≥ baseline (if not, revert)
History log committed to repo for traceability
Variant generator doesn't drift from intended scope (review proposed variants)
Cost cap enforced; runaway loops blocked

See also

eval-set-generator for the test cases this optimizes against
custom-metric-builder for the metric this maximizes