From agno-agentos-api
Interact with AgentOS Evals API endpoints. For standard operations (listing evals, running accuracy/performance evals, getting eval details), use the provided CLI script first. Only write custom Python when the script cannot handle the use case (e.g., advanced filtering, update/delete operations, chaining evals, agent-as-judge). Trigger when: running evaluations, listing eval runs, benchmarking agents, or asking things like "run an accuracy eval on my agent" or "show me the latest eval results."
How this skill is triggered — by the user, by Claude, or both
Slash command
/agno-agentos-api:agentos-api-evalsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Start an AgentOS server with agents:
Start an AgentOS server with agents:
export ANTHROPIC_API_KEY=sk-...
uv run start_agentos.py # See agno-agentos skill
Always try the provided script first. It covers listing evals, running accuracy and performance evaluations, filtering by agent, and getting eval details — all from the command line with no custom code needed.
The script is at: scripts/run_evals.py
uv run scripts/run_evals.py
uv run scripts/run_evals.py \
--agent-id my-agent \
--accuracy --input "What is 2+2?" --expected "4"
uv run scripts/run_evals.py \
--agent-id my-agent \
--performance --input "Hello" --iterations 3
uv run scripts/run_evals.py --eval-id abc-123
uv run scripts/run_evals.py --agent-id my-agent
uv run scripts/run_evals.py --base-url http://my-server:8000
uv run scripts/run_evals.py --help
Only write ad-hoc Python when the CLI script cannot handle your use case:
| Method | Path | Description |
|---|---|---|
| GET | /eval-runs | List evaluation runs (paginated, filterable) |
| GET | /eval-runs/{eval_id} | Get evaluation run details |
| POST | /eval-runs | Execute evaluation |
| PATCH | /eval-runs/{eval_id} | Update evaluation run |
| DELETE | /eval-runs | Delete evaluation runs |
| Type | Description |
|---|---|
accuracy | Compare agent output against expected output |
performance | Measure response time and throughput |
agent_as_judge | Use another agent to evaluate quality |
reliability | Test consistency across multiple runs |
The list endpoint supports filtering beyond what the CLI provides:
async def main():
client = AgentOSClient(base_url="http://localhost:7777")
evals = await client.list_eval_runs()
print(f"Found {len(evals.data)} evaluation runs")
for eval_run in evals.data:
print(f" ID: {eval_run.id}")
print(f" Name: {eval_run.name}")
print(f" Type: {eval_run.eval_type}")
print(f" Agent: {eval_run.agent_id}")
Supported filter parameters:
agent_id: Filter by agentteam_id: Filter by teamworkflow_id: Filter by workflowmodel_id: Filter by modeltype: Filter by component type (agent, team, workflow)eval_types: Comma-separated eval types (accuracy, performance, agent_as_judge, reliability)limit, page: Pagination (default: 20 per page)sort_by, sort_order: Sorting (default: created_at desc)Run multiple eval types and aggregate results in a single script:
import asyncio
from agno.client import AgentOSClient
from agno.db.schemas.evals import EvalType
async def main():
client = AgentOSClient(base_url="http://localhost:7777")
config = await client.aget_config()
agent_id = config.agents[0].id
# Run accuracy eval
accuracy = await client.run_eval(
agent_id=agent_id,
eval_type=EvalType.ACCURACY,
input_text="What is the capital of France?",
expected_output="Paris",
)
print(f"Accuracy eval: {accuracy.id if accuracy else 'failed'}")
# Run performance eval
perf = await client.run_eval(
agent_id=agent_id,
eval_type=EvalType.PERFORMANCE,
input_text="Hello",
num_iterations=2,
)
print(f"Performance eval: {perf.id if perf else 'failed'}")
# List all evaluations
evals = await client.list_eval_runs()
print(f"\nTotal evaluations: {len(evals.data)}")
for e in evals.data:
print(f" {e.id}: {e.eval_type} — {e.name}")
asyncio.run(main())
await — all client methods are asyncaget_config() or the CLI scriptnum_iterations can be slow and expensiveeval_data contains the actual evaluation results and metricsFor advanced eval API patterns, read references/api-patterns.md.
npx claudepluginhub ajshedivy/agno-cookbook --plugin agno-agentos-apiRuns evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.
Builds AI agent evaluations using Anthropic patterns: code/model/human graders, tasks, trials, benchmarks for coding, conversational, research agents.
Builds evaluation frameworks for agent systems to test performance systematically, validate context engineering choices, and measure improvements over time.