From evaluation
Defines task success metrics like completion rate, time to completion, and intervention rate to evaluate if AI helps users achieve goals beyond output quality.
How this skill is triggered — by the user, by Claude, or both
Slash command
/evaluation:task-success-metricsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Output quality doesn't guarantee task success. The AI might produce a beautiful response that doesn't actually help the user do what they came to do. Task success metrics measure the end-to-end outcome.
Output quality doesn't guarantee task success. The AI might produce a beautiful response that doesn't actually help the user do what they came to do. Task success metrics measure the end-to-end outcome.
For each user task, define:
These can diverge:
npx claudepluginhub owl-listener/ai-design-skills --plugin evaluationTracks AI product quality over time, detecting drift, degradation, and improvements using golden test sets, automated evals, dashboards, and alerts. Useful for AI reliability maintenance.
Monitors AI agent health across quality, cost, performance, and errors using Amplitude Agent Analytics. Proactive health reports and drill-down into failing sessions.
Audits pre-launch AI features across 6 dimensions—model selection, data quality, cost, monitoring, failure UX, optimization—grading readiness and blocking shipment of broken products.