Skill

agent-ops

Evaluates and monitors AI agents with Opik observability. Covers architecture patterns, tracing, evaluation metrics, and production monitoring for reliable agents.

ai-ml

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/opik:agent-ops

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

This skill covers the agent lifecycle beyond basic tracing: architecture patterns, evaluation, metrics, and production monitoring. All examples use Opik for observability — for SDK details (tracing, integrations, span types), load the `opik` skill.

Supporting Files

references/agent-patterns.mdreferences/evaluation.mdreferences/production.md

SKILL.md

107 lines · ~1.1k tokens

Stats

LanguageGo

Stars18

Forks3

MaintenanceExcellent

Last CommitJun 2, 2026

Actions

View Source View Plugin View on GitHub View README

Agent Operations: Build, Evaluate, and Monitor AI Agents

The Agent Lifecycle

Instrument — Add Opik tracing to make your agent's behavior visible (see opik skill)
Evaluate — Measure performance with datasets, metrics, and experiments
Monitor — Track quality, cost, and reliability in production
Optimize — Improve based on data from evaluation and production traces

Agent Architecture Patterns

Trace every component of your agent with appropriate span types:

import opik

@opik.track(name="research_agent")
def agent(query: str) -> str:
    plan = plan_action(query)        # general span
    results = execute_tool(plan)     # tool span
    return generate_response(results) # llm span

@opik.track(type="tool")
def execute_tool(action: dict) -> str:
    return search_web(action["query"])

@opik.track(type="llm")
def generate_response(context: str) -> str:
    return llm_call(context)

What to Trace

Component	Span Type	Key Data
Planning	`general`	Reasoning steps, decisions
Tool calls	`tool`	Tool name, parameters, results
LLM calls	`llm`	Prompt, response, tokens
Retrieval	`tool`	Query, documents
Validation	`guardrail`	Check results, pass/fail

Evaluation

Evaluate agents at multiple levels — end-to-end and per-component:

from opik.evaluation import evaluate
from opik.evaluation.metrics import AnswerRelevance, Hallucination, AgentTaskCompletion

results = evaluate(
    experiment_name="agent-v2",
    dataset=dataset,
    task=lambda item: {"output": agent(item["input"])},
    scoring_metrics=[
        AnswerRelevance(),
        Hallucination(),
        AgentTaskCompletion(),
    ]
)

Built-in Agent Metrics

Metric	What It Measures
`AgentTaskCompletion`	Did the agent fulfill its task?
`AgentToolCorrectness`	Were tools used correctly?
`TrajectoryAccuracy`	Did actions match expected sequence?
`AnswerRelevance`	Does the answer address the question?
`Hallucination`	Are there unsupported claims?

41 Total Built-in Metrics

Heuristic (Equals, Contains, BLEU, ROUGE, BERTScore, IsJson, etc.), LLM-as-Judge (AnswerRelevance, Hallucination, Usefulness, GEval, etc.), RAG (ContextPrecision, ContextRecall, Faithfulness), and conversation metrics. See references/evaluation.md for the full list.

Production Monitoring

Dashboards — Visualize quality, cost, latency, and error trends
Online evaluation — Automatically score production traces with LLM-as-Judge
Alerts — Get notified when metrics deviate (quality drops, cost spikes, error rates)
Guardrails — PII detection, topic validation, custom safety checks
Opik Assist — AI-powered root cause analysis for failed traces

Common Anti-Patterns

Category	Anti-Pattern
Reliability	Unbounded loops, retry storms, silent failures
Security	Prompt injection, privilege escalation, data leakage
Observability	Late tracing (missing input), orphaned spans
Tools	Tool loops, hallucinated tools, parameter errors

Detailed References

Topic	Reference File
Agent architecture, reliability, security patterns	`references/agent-patterns.md`
Evaluation datasets, experiments, all 41 metrics	`references/evaluation.md`
Production dashboards, alerts, guardrails, cost tracking	`references/production.md`

agent-ops

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

agent-ops

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Agent Operations: Build, Evaluate, and Monitor AI Agents

The Agent Lifecycle

Agent Architecture Patterns

What to Trace

Evaluation

Built-in Agent Metrics

41 Total Built-in Metrics

Production Monitoring

Common Anti-Patterns

Detailed References

Similar Skills

Agent Operations: Build, Evaluate, and Monitor AI Agents

The Agent Lifecycle

Agent Architecture Patterns

What to Trace

Evaluation

Built-in Agent Metrics

41 Total Built-in Metrics

Production Monitoring

Common Anti-Patterns

Detailed References

Similar Skills