From agentic-usability
Compares reference and generated solutions via an LLM judge, scoring on API discovery, call correctness, completeness, and functional correctness. Automated evaluation for AI-generated code.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agentic-usability:judge [project-directory] [--tests TC-001,TC-002] [--run runId][project-directory] [--tests TC-001,TC-002] [--run runId]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run the judge stage. For each test case and target, an LLM judge compares the reference solution against the generated solution and produces scores.
Run the judge stage. For each test case and target, an LLM judge compares the reference solution against the generated solution and produces scores.
echo "Arguments: $ARGUMENTS"
--tests <ids>: Comma-separated test case IDs to judge--run <runId>: Target a specific run (default: latest)Each test case receives scores on:
| Dimension | Range | What it measures |
|---|---|---|
apiDiscovery | 0-100 | Did the agent find the correct SDK endpoints/methods? |
callCorrectness | 0-100 | Are API calls constructed correctly (params, headers, body)? |
completeness | 0-100 | Does the solution handle all requirements? |
functionalCorrectness | 0-100 | Does the code run and produce correct output? |
overallVerdict | boolean | Does the solution actually work? |
notes | string | Brief explanation of scoring decisions |
Written to results/<runId>/<target>/<testId>/judge.json:
{
"testId": "TC-001",
"target": "node-20",
"apiDiscovery": 85,
"callCorrectness": 90,
"completeness": 75,
"functionalCorrectness": 80,
"overallVerdict": true,
"notes": "Found correct APIs, minor parameter issue in error handling path"
}
If the executor produced no solution, the judge writes an all-zero score:
{ "apiDiscovery": 0, "callCorrectness": 0, "completeness": 0, "functionalCorrectness": 0, "overallVerdict": false, "notes": "No solution produced (DNF)" }
| File | Description |
|---|---|
judge.json | Full scoring result |
judge-cmd.log | Judge command executed |
judge-output.log | Raw judge stdout/stderr |
judge-session.jsonl | Judge conversation log (if available) |
judge-egress.log.json | Judge network traffic |
judge-error.log | Error (only on failure) |
Tracked in results/<runId>/pipeline-state.json:
completed.judge["<target>"] lists judged test IDsRun agentic-usability judge -p $ARGUMENTS and report the results.
For detailed internals, see pipeline-guide.md.
npx claudepluginhub pspdfkit-labs/agentic-usability --plugin agentic-usabilityScores candidate artifacts against user criteria on 1-10 scale and generates ASI (highest-leverage direction) for next iteration in simmer workflow. Supports judge-only, runnable evaluator, hybrid modes.
Orchestrates parallel judge agent execution to evaluate implementation plans (16 judges), code artifacts (11 judges), or PRDs (4 judges); aggregates CaseScore results into validated JSON files.
Use this skill when the user asks to "set up LLM as a judge", "write an LLM judge prompt", "automate quality evaluation", "use Claude to evaluate outputs", "build an automated eval", "LLM-based evaluation", or wants to create a scalable automated evaluation system where one LLM grades the outputs of another LLM.