From agentops
Grades agent or model output against holdout scenarios with acceptance vectors and satisfaction scoring. Manages holdout evals, runtime comparisons, and verdict records.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agentops:eval-outcomesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Author and manage holdout scenarios with the `ao` CLI: `ao scenario add "<title>"`
Author and manage holdout scenarios with the ao CLI: ao scenario add "<title>"
creates a scenario in .agents/holdout/ (ID s-YYYY-MM-DD-NNN, acceptance
vectors, 0.8 default satisfaction threshold); ao scenario validate checks the
holdout set's schema and link graph. Linked scenarios feed directive fitness via
ao goals scenarios (see the /goals skill and docs/adr/ADR-0003).
.agents/holdout/ for behavioral validation.This skill encodes independent-verdict machinery and now lives with the outer
gate product. Canonical: ~/dev/mt-olympus/.claude/skills/eval-outcomes/SKILL.md —
read and follow that file. This stub preserves fleet routing until the
using-agentops catalog closer updates the registry (skill-prune Lane A,
evidence/skill-prune-recon.md).
eval-outcomes is the fold target for the retired standalone scenario skill
(skill-prune phase 2). Fire this skill for its use-cases:
.agents/holdout/*.json so implementing agents cannot see them
during development. When asked to author, manage, or score holdout scenarios,
fire this skill.npx claudepluginhub boshu2/agentops --plugin agentopsExecutes skill evaluations against test cases, scores outputs with judges, and reports results. Use when testing a skill, benchmarking, detecting regressions, or verifying changes.
Runs evaluation scenarios to benchmark agent performance via reflexion loops, validates success criteria, records metrics, generates reports, and proposes new evals from logs.
Tests skills for correct agent behavior via EVAL.md scenarios after modifications, periodic reviews, or model upgrades. Supports manual, scout, and automated bash-script evals.