From agentic-usability
Run the full evaluation pipeline (execute, judge, report) for SDK usability benchmarks. Supports resume, status checks, and labeling.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-usability:eval [project-directory] [--resume] [--fresh] [--label name][project-directory] [--resume] [--fresh] [--label name]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run the complete benchmark pipeline: **execute → judge → report**.
Run the complete benchmark pipeline: execute → judge → report.
echo "Arguments: $ARGUMENTS"
report.json--resume: Resume from the last checkpoint of an interrupted pipeline--fresh: Only useful with --resume. Resets pipeline state so the run re-executes from scratch in the same run directory. Does NOT delete result files. Without --resume, a new run always starts fresh anyway.--label <name>: Human-readable label for this run--run <runId>: Only used with --resume. Target a specific run instead of auto-detecting the latest incomplete one.Before running, you can check if a pipeline is paused/interrupted by reading the pipeline state file:
Pipeline state location: <project>/results/<runId>/pipeline-state.json
{
"stage": "execute",
"startedAt": "2026-04-25T10:30:00.000Z",
"testCases": 15,
"completed": {
"execute": { "node-20": ["TC-001", "TC-002"] },
"judge": { "node-20": [] }
}
}
How to check status:
stage is "execute" or "judge" → pipeline is incomplete/pausedstage is "report" → pipeline completed successfullycompleted[stage][target].length vs testCases to see progressreport.json in the run directory → pipeline didn't finishresults/ containing run.jsonRun manifest (results/<runId>/run.json):
{
"id": "run-2026-04-25T10-30-00-000Z",
"createdAt": "2026-04-25T10:30:00.000Z",
"targets": ["node-20"],
"testCount": 15,
"label": "baseline v2"
}
When --resume is passed:
stage !== "report"), or uses --run <id>completed map for each targetRun agentic-usability eval -p $ARGUMENTS and monitor the output. If interrupted, suggest --resume to continue.
For detailed pipeline internals, see pipeline-guide.md.
npx claudepluginhub pspdfkit-labs/agentic-usability --plugin agentic-usabilityDisplays a terminal scorecard of benchmark results with pass rates, scores by difficulty, and per-test breakdowns. Use when the user asks about benchmark results, scores, or SDK performance.
Executes skill evaluations against test cases, scores outputs with judges, and reports results. Use when testing a skill, benchmarking, detecting regressions, or verifying changes.
Runs evaluations on Copilot Studio draft agents via Power Platform Evaluation API. Lists test sets, starts/polls runs, fetches results, proposes YAML fixes. Use to test changes without publishing.