From agentic-usability
Displays a terminal scorecard of benchmark results with pass rates, scores by difficulty, and per-test breakdowns. Use when the user asks about benchmark results, scores, or SDK performance.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-usability:report [project-directory] [--json] [--run runId][project-directory] [--json] [--run runId]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Display the benchmark scorecard for the pipeline.
Display the benchmark scorecard for the pipeline.
agentic-usability report -p $ARGUMENTS
--json: Output raw structured JSON instead of the colored table--run <runId>: Show results for a specific run (default: latest)results/<runId>/
report.json # Aggregate scorecard
<targetName>/<testId>/
judge.json # Per-test judge scores
generated-solution.json # Agent's solution
agent-notes.md # Agent's working notes
| Dimension | Range | What it measures |
|---|---|---|
apiDiscovery | 0-100 | Found correct SDK endpoints/methods? |
callCorrectness | 0-100 | API calls constructed correctly? |
completeness | 0-100 | All requirements handled? |
functionalCorrectness | 0-100 | Code runs and produces correct output? |
overallVerdict | boolean | Solution works? (pass/fail) |
The report aggregates these across all test cases and breaks them down by difficulty (easy/medium/hard).
Runs are stored as subdirectories in results/ containing run.json:
{ "id": "run-2026-04-25T10-30-00-000Z", "createdAt": "...", "targets": [...], "testCount": 15, "label": "..." }
To list all runs, look for results/*/run.json files.
Present the results to the user. If they want deeper analysis, suggest using the insights skill.
For detailed file inventory, see pipeline-guide.md.
npx claudepluginhub pspdfkit-labs/agentic-usability --plugin agentic-usabilityAnalyzes SDK benchmark results to identify failure patterns, documentation gaps, and API design issues. Use when reviewing evaluation runs or improving SDK usability.
Views evaluation results and benchmark reports for Claude Code skills and plugins. Reviews past evals, compares benchmark runs, and tracks quality trends via tables.
Compares harness evaluation history: shows score trends, per-tier deltas, diminishing returns detection, grade projections, bilingual reports, and ASCII charts. Useful after 2+ evaluations.