From adk-evaluation
Use this skill to define custom evaluation metrics for ADK 2.0 agents beyond the built-in tool-match and response-similarity scores. Triggers on: "ADK custom metric", "ADK eval metric", "score ADK agent on X", "custom evaluator ADK", "ADK rubric eval", "LLM-as-judge ADK", "eval metric Python ADK". Generates a Python evaluator class (or function) that scores eval cases on domain-specific criteria.
How this skill is triggered — by the user, by Claude, or both
Slash command
/adk-evaluation:custom-metric-builderThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Custom metric for `adk eval` — score on whatever matters for your domain (medical accuracy, code correctness, brand voice, citation quality).
Custom metric for adk eval — score on whatever matters for your domain (medical accuracy, code correctness, brand voice, citation quality).
from google.adk.evaluation import Metric, EvalCaseResult
class CitationCountMetric(Metric):
"""Count [source: URL] citations in the agent's final response."""
name = "citation_count"
def evaluate(self, case_result: EvalCaseResult) -> dict:
text = case_result.final_response_text
import re
citations = re.findall(r"\[source:\s*https?://", text)
return {
"score": min(len(citations) / 3.0, 1.0), # normalize: 3+ = full marks
"details": {"count": len(citations)},
}
from google.adk.evaluation import Metric
from google.adk.models.lite_llm import LiteLlm
JUDGE = LiteLlm(model="gemini-2.5-pro")
class MedicalAccuracyMetric(Metric):
name = "medical_accuracy"
async def evaluate(self, case_result):
rubric = (
"Score the response 0-10 on medical accuracy. "
"Penalize: incorrect dosing, contraindications missed, "
"non-evidence-based claims. Output JSON: {score: int, rationale: str}."
)
out = await JUDGE.complete(
f"{rubric}\n\nResponse:\n{case_result.final_response_text}"
)
import json
parsed = json.loads(out)
return {"score": parsed["score"] / 10.0, "details": {"rationale": parsed["rationale"]}}
from google.adk.evaluation import EvalRunner
runner = EvalRunner(
agent=root_agent,
metrics=[CitationCountMetric(), MedicalAccuracyMetric()],
)
report = await runner.run("./eval_set.evalset.json")
CLI flag form:
adk eval ./agent.py ./eval_set.evalset.json \
--metrics my_module.CitationCountMetric,my_module.MedicalAccuracyMetric
class TripleRubric(Metric):
name = "triple_rubric"
async def evaluate(self, case_result):
scores = {
"accuracy": await self._judge(case_result, "factual accuracy"),
"tone": await self._judge(case_result, "professional tone"),
"brevity": await self._judge(case_result, "conciseness"),
}
return {"score": sum(scores.values()) / 3, "details": scores}
{"score": float in [0,1], "details": dict}eval-set-generator for the test cases this scoresagent-optimization-loop to use these metrics as the optimization signalnpx claudepluginhub healthcare-ai-consulting-llc/adk-2-toolkit --plugin adk-evaluationCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.