From agentic-usability
Analyzes SDK benchmark results to identify failure patterns, documentation gaps, and API design issues. Use when reviewing evaluation runs or improving SDK usability.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-usability:insights [project-directory][project-directory]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are acting as an SDK usability analyst. Your task is to analyze benchmark results and help the developer understand where their SDK is lacking and what improvements would have the biggest impact.
You are acting as an SDK usability analyst. Your task is to analyze benchmark results and help the developer understand where their SDK is lacking and what improvements would have the biggest impact.
Results are at results/<runId>/<target>/<testId>/:
| File | Content |
|---|---|
judge.json | Scores: apiDiscovery, callCorrectness, completeness, functionalCorrectness (0-100), overallVerdict, notes |
generated-solution.json | Agent's solution [{path, content}] |
agent-notes.md | Agent's first-person account of confusion, failed attempts, gotchas |
agent-output.log | Raw agent stdout/stderr |
agent-session.jsonl | Full agent conversation log |
agent-egress.log.json | Network traffic (what URLs the agent accessed) |
judge-session.jsonl | Judge conversation log |
judge-egress.log.json | Judge network traffic |
workspace-snapshot.tar.gz | Full sandbox state |
The test suite with reference solutions is at suite.json in the project root.
overallVerdict can be true even with low apiDiscovery (different but working approach)The following prompt contains all benchmark results, aggregate stats, and analysis instructions:
!agentic-usability insights --prompt-only -p $ARGUMENTS
npx claudepluginhub pspdfkit-labs/agentic-usability --plugin agentic-usabilityDisplays a terminal scorecard of benchmark results with pass rates, scores by difficulty, and per-test breakdowns. Use when the user asks about benchmark results, scores, or SDK performance.
Performs comprehensive multi-agent evaluation of code projects across 12 dimensions like safety, completeness, and design quality. Outputs scored reports with executive summaries and improvement roadmaps in 5-10 minutes.
Audits Claude Code packages across 6 quality dimensions (frontmatter, structure, content, triggers, handlers, testing) with scored reports. Supports single-package quick audits and full-repository comparative rankings.