From opik
Evaluates and monitors AI agents with Opik observability. Covers architecture patterns, tracing, evaluation metrics, and production monitoring for reliable agents.
How this skill is triggered — by the user, by Claude, or both
Slash command
/opik:agent-opsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill covers the agent lifecycle beyond basic tracing: architecture patterns, evaluation, metrics, and production monitoring. All examples use Opik for observability — for SDK details (tracing, integrations, span types), load the `opik` skill.
This skill covers the agent lifecycle beyond basic tracing: architecture patterns, evaluation, metrics, and production monitoring. All examples use Opik for observability — for SDK details (tracing, integrations, span types), load the opik skill.
opik skill)Trace every component of your agent with appropriate span types:
import opik
@opik.track(name="research_agent")
def agent(query: str) -> str:
plan = plan_action(query) # general span
results = execute_tool(plan) # tool span
return generate_response(results) # llm span
@opik.track(type="tool")
def execute_tool(action: dict) -> str:
return search_web(action["query"])
@opik.track(type="llm")
def generate_response(context: str) -> str:
return llm_call(context)
| Component | Span Type | Key Data |
|---|---|---|
| Planning | general | Reasoning steps, decisions |
| Tool calls | tool | Tool name, parameters, results |
| LLM calls | llm | Prompt, response, tokens |
| Retrieval | tool | Query, documents |
| Validation | guardrail | Check results, pass/fail |
Evaluate agents at multiple levels — end-to-end and per-component:
from opik.evaluation import evaluate
from opik.evaluation.metrics import AnswerRelevance, Hallucination, AgentTaskCompletion
results = evaluate(
experiment_name="agent-v2",
dataset=dataset,
task=lambda item: {"output": agent(item["input"])},
scoring_metrics=[
AnswerRelevance(),
Hallucination(),
AgentTaskCompletion(),
]
)
| Metric | What It Measures |
|---|---|
AgentTaskCompletion | Did the agent fulfill its task? |
AgentToolCorrectness | Were tools used correctly? |
TrajectoryAccuracy | Did actions match expected sequence? |
AnswerRelevance | Does the answer address the question? |
Hallucination | Are there unsupported claims? |
Heuristic (Equals, Contains, BLEU, ROUGE, BERTScore, IsJson, etc.), LLM-as-Judge (AnswerRelevance, Hallucination, Usefulness, GEval, etc.), RAG (ContextPrecision, ContextRecall, Faithfulness), and conversation metrics. See references/evaluation.md for the full list.
| Category | Anti-Pattern |
|---|---|
| Reliability | Unbounded loops, retry storms, silent failures |
| Security | Prompt injection, privilege escalation, data leakage |
| Observability | Late tracing (missing input), orphaned spans |
| Tools | Tool loops, hallucinated tools, parameter errors |
| Topic | Reference File |
|---|---|
| Agent architecture, reliability, security patterns | references/agent-patterns.md |
| Evaluation datasets, experiments, all 41 metrics | references/evaluation.md |
| Production dashboards, alerts, guardrails, cost tracking | references/production.md |
npx claudepluginhub comet-ml/opik-claude-code-pluginEvaluates and improves GenAI agent output quality using MLflow's native APIs for datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components.
Runs evaluations on ADK agents: writing eval datasets, analyzing failures, comparing results, and optimizing agents using the Quality Flywheel methodology.
Sets up evaluators, online monitoring, CI/CD quality gates, CloudWatch/X-Ray observability, and cost optimization for AgentCore agents to measure and improve quality and performance.