Skill

agentos-api-evals

Interact with AgentOS Evals API endpoints. For standard operations (listing evals, running accuracy/performance evals, getting eval details), use the provided CLI script first. Only write custom Python when the script cannot handle the use case (e.g., advanced filtering, update/delete operations, chaining evals, agent-as-judge). Trigger when: running evaluations, listing eval runs, benchmarking agents, or asking things like "run an accuracy eval on my agent" or "show me the latest eval results."

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agno-agentos-api:agentos-api-evals

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Start an AgentOS server with agents:

Supporting Files

references/api-patterns.mdscripts/run_evals.py

SKILL.md

197 lines · ~1.5k tokens

Stats

LanguagePython

Parent stars3

MaintenanceGood

Last CommitMar 6, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

AgentOS Evals API

Prerequisites

Start an AgentOS server with agents:

export ANTHROPIC_API_KEY=sk-...
uv run start_agentos.py  # See agno-agentos skill

Default: Use the CLI Script

Always try the provided script first. It covers listing evals, running accuracy and performance evaluations, filtering by agent, and getting eval details — all from the command line with no custom code needed.

The script is at: scripts/run_evals.py

List all evaluation runs

uv run scripts/run_evals.py

Run an accuracy evaluation

uv run scripts/run_evals.py \
  --agent-id my-agent \
  --accuracy --input "What is 2+2?" --expected "4"

Run a performance evaluation

uv run scripts/run_evals.py \
  --agent-id my-agent \
  --performance --input "Hello" --iterations 3

Get details for a specific eval

uv run scripts/run_evals.py --eval-id abc-123

Filter evals by agent

uv run scripts/run_evals.py --agent-id my-agent

Use a different server

uv run scripts/run_evals.py --base-url http://my-server:8000

Full CLI reference

uv run scripts/run_evals.py --help

When to Write Custom Python

Only write ad-hoc Python when the CLI script cannot handle your use case:

Advanced filtering (by team, workflow, model, eval type, pagination, sorting)
Update or delete evaluation runs
Agent-as-judge or reliability eval types (not supported by the CLI)
Chaining multiple evals in a single script
Custom error handling or retry logic
Integration tests that assert on eval results programmatically

API Endpoints

Method	Path	Description
GET	`/eval-runs`	List evaluation runs (paginated, filterable)
GET	`/eval-runs/{eval_id}`	Get evaluation run details
POST	`/eval-runs`	Execute evaluation
PATCH	`/eval-runs/{eval_id}`	Update evaluation run
DELETE	`/eval-runs`	Delete evaluation runs

Evaluation Types

Type	Description
`accuracy`	Compare agent output against expected output
`performance`	Measure response time and throughput
`agent_as_judge`	Use another agent to evaluate quality
`reliability`	Test consistency across multiple runs

Custom Python Examples

List with Advanced Filtering

The list endpoint supports filtering beyond what the CLI provides:

async def main():
    client = AgentOSClient(base_url="http://localhost:7777")

    evals = await client.list_eval_runs()
    print(f"Found {len(evals.data)} evaluation runs")

    for eval_run in evals.data:
        print(f"  ID: {eval_run.id}")
        print(f"  Name: {eval_run.name}")
        print(f"  Type: {eval_run.eval_type}")
        print(f"  Agent: {eval_run.agent_id}")

Supported filter parameters:

agent_id: Filter by agent
team_id: Filter by team
workflow_id: Filter by workflow
model_id: Filter by model
type: Filter by component type (agent, team, workflow)
eval_types: Comma-separated eval types (accuracy, performance, agent_as_judge, reliability)
limit, page: Pagination (default: 20 per page)
sort_by, sort_order: Sorting (default: created_at desc)

Chained Evaluation Workflow

Run multiple eval types and aggregate results in a single script:

import asyncio
from agno.client import AgentOSClient
from agno.db.schemas.evals import EvalType

async def main():
    client = AgentOSClient(base_url="http://localhost:7777")
    config = await client.aget_config()
    agent_id = config.agents[0].id

    # Run accuracy eval
    accuracy = await client.run_eval(
        agent_id=agent_id,
        eval_type=EvalType.ACCURACY,
        input_text="What is the capital of France?",
        expected_output="Paris",
    )
    print(f"Accuracy eval: {accuracy.id if accuracy else 'failed'}")

    # Run performance eval
    perf = await client.run_eval(
        agent_id=agent_id,
        eval_type=EvalType.PERFORMANCE,
        input_text="Hello",
        num_iterations=2,
    )
    print(f"Performance eval: {perf.id if perf else 'failed'}")

    # List all evaluations
    evals = await client.list_eval_runs()
    print(f"\nTotal evaluations: {len(evals.data)}")
    for e in evals.data:
        print(f"  {e.id}: {e.eval_type} — {e.name}")

asyncio.run(main())

Anti-Patterns

Don't write custom Python for basic operations — use the CLI script for listing, running accuracy/performance evals, and getting details
Don't forget await — all client methods are async
Don't hardcode agent IDs — discover them from aget_config() or the CLI script
Don't skip error handling — eval endpoints may fail if agents aren't configured properly
Don't use too many iterations — performance evals with high num_iterations can be slow and expensive
Don't ignore eval data — eval_data contains the actual evaluation results and metrics

agentos-api-evals

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

agentos-api-evals

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

AgentOS Evals API

Prerequisites

Default: Use the CLI Script

List all evaluation runs

Run an accuracy evaluation

Run a performance evaluation

Get details for a specific eval

Filter evals by agent

Use a different server

Full CLI reference

When to Write Custom Python

API Endpoints

Evaluation Types

Custom Python Examples

List with Advanced Filtering

Chained Evaluation Workflow

Anti-Patterns

Further Reading

Similar Skills

AgentOS Evals API

Prerequisites

Default: Use the CLI Script

List all evaluation runs

Run an accuracy evaluation

Run a performance evaluation

Get details for a specific eval

Filter evals by agent

Use a different server

Full CLI reference

When to Write Custom Python

API Endpoints

Evaluation Types

Custom Python Examples

List with Advanced Filtering

Chained Evaluation Workflow

Anti-Patterns

Further Reading

Similar Skills