From evaluation
Tracks AI product quality over time, detecting drift, degradation, and improvements using golden test sets, automated evals, dashboards, and alerts. Useful for AI reliability maintenance.
How this skill is triggered — by the user, by Claude, or both
Slash command
/evaluation:longitudinal-measurementThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
AI products change over time — models get updated, usage patterns shift, and quality can drift without anyone noticing. Longitudinal measurement is how you track quality across time and catch degradation before users do.
AI products change over time — models get updated, usage patterns shift, and quality can drift without anyone noticing. Longitudinal measurement is how you track quality across time and catch degradation before users do.
When measurements show drift:
npx claudepluginhub owl-listener/ai-design-skills --plugin evaluationUse this skill when the user asks about "continuous improvement for AI", "AI quality flywheel", "how do we keep improving our AI feature", "closing the eval feedback loop", "systematic AI improvement process", or wants to build a repeating process that continuously improves AI product quality over time rather than doing one-off fixes.
Guides post-launch AI feature calibration: document production error patterns, review eval performance, decide agency promotion. Uses CC/CD loop with /calibrate shortcuts.
Monitors AI agent health across quality, cost, performance, and errors using Amplitude Agent Analytics. Proactive health reports and drill-down into failing sessions.