From adk-evaluation
Use this skill to run an automated optimization loop on an ADK 2.0 agent — iteratively tune instructions, models, or tool selection against an eval set until scores plateau. Triggers on: "ADK agent optimization", "auto-tune ADK agent", "improve ADK agent score", "ADK prompt optimization", "ADK iterative optimizer", "ADK eval-driven improvement", "optimize agent automatically". Generates an optimizer loop that mutates agent config, runs eval, picks the winning variant.
How this skill is triggered — by the user, by Claude, or both
Slash command
/adk-evaluation:agent-optimization-loopThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Iteratively improve an ADK 2.0 agent against an eval set. Mutate → evaluate → keep best → repeat.
Iteratively improve an ADK 2.0 agent against an eval set. Mutate → evaluate → keep best → repeat.
import asyncio
from google.adk.agents import LlmAgent
from google.adk.evaluation import EvalRunner
BASE_INSTRUCTION = "You are a weather assistant. Answer concisely."
async def evaluate(instruction: str, eval_set_path: str) -> float:
agent = LlmAgent(
name="candidate",
model="gemini-2.5-flash",
instruction=instruction,
tools=[get_weather],
)
runner = EvalRunner(agent=agent, metrics=[default_metric])
report = await runner.run(eval_set_path)
return report.aggregate_score
async def propose_variants(current: str, weak_cases: list) -> list[str]:
"""Use an LLM to propose 3 instruction tweaks given failure modes."""
judge = LiteLlm(model="gemini-2.5-pro")
prompt = (
f"Current instruction:\n{current}\n\n"
f"These cases failed:\n{format_cases(weak_cases)}\n\n"
"Propose 3 alternative instructions that might fix the failures. "
"Output JSON: {variants: [str, str, str]}."
)
out = await judge.complete(prompt)
return parse_variants(out)
async def optimize(eval_set: str, max_iters: int = 5):
current = BASE_INSTRUCTION
best_score = await evaluate(current, eval_set)
history = [(current, best_score)]
for i in range(max_iters):
weak = await get_failing_cases(current, eval_set)
if not weak:
break
variants = await propose_variants(current, weak)
scores = await asyncio.gather(*(evaluate(v, eval_set) for v in variants))
best_idx = max(range(len(scores)), key=lambda i: scores[i])
if scores[best_idx] > best_score:
current = variants[best_idx]
best_score = scores[best_idx]
history.append((current, best_score))
print(f"Iter {i}: new best {best_score:.3f}")
else:
print(f"Iter {i}: no improvement; stopping.")
break
return current, best_score, history
best_inst, best_score, history = asyncio.run(optimize("./eval.evalset.json"))
| Knob | When to vary |
|---|---|
| Instruction text | Most impactful; vary first |
| Model | If quality plateaus on flash, try pro |
| Tool order | Sometimes affects which tools the LLM picks |
| Temperature | For creative vs deterministic tasks |
| Few-shot examples | When edge cases keep failing |
Don't optimize on your test set:
# Split eval set 60/40
train_set, test_set = split_eval_set("./full.evalset.json", train_fraction=0.6)
# Optimize on train, report final score on test
optimize(train_set)
final_score = await evaluate(best_instruction, test_set)
MAX_COST_USD = 5.00
total_cost = 0
async def evaluate_with_budget(instruction, eval_set):
nonlocal total_cost
if total_cost > MAX_COST_USD:
raise BudgetExceeded()
score, cost = await _evaluate(instruction, eval_set)
total_cost += cost
return score
eval-set-generator for the test cases this optimizes againstcustom-metric-builder for the metric this maximizesCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub healthcare-ai-consulting-llc/adk-2-toolkit --plugin adk-evaluation