Skill

judge

Compares reference and generated solutions via an LLM judge, scoring on API discovery, call correctness, completeness, and functional correctness. Automated evaluation for AI-generated code.

testing

api-development

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agentic-usability:judge [project-directory] [--tests TC-001,TC-002] [--run runId]

User invocable

Model invocation disabled

Inline context

Default effort

Uses dynamic context injection — preprocesses shell commands at runtime

Argument hint[project-directory] [--tests TC-001,TC-002] [--run runId]

Tool Access

This skill is limited to the following tools:

Bash(agentic-usability *)ReadGlob

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Run the judge stage. For each test case and target, an LLM judge compares the reference solution against the generated solution and produces scores.

SKILL.md

86 lines · ~705 tokens

Stats

LanguageTypeScript

Stars15

MaintenanceExcellent

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Judge Solutions

Run the judge stage. For each test case and target, an LLM judge compares the reference solution against the generated solution and produces scores.

echo "Arguments: $ARGUMENTS"

Options

--tests <ids>: Comma-separated test case IDs to judge
--run <runId>: Target a specific run (default: latest)

Scoring Dimensions

Each test case receives scores on:

Dimension	Range	What it measures
`apiDiscovery`	0-100	Did the agent find the correct SDK endpoints/methods?
`callCorrectness`	0-100	Are API calls constructed correctly (params, headers, body)?
`completeness`	0-100	Does the solution handle all requirements?
`functionalCorrectness`	0-100	Does the code run and produce correct output?
`overallVerdict`	boolean	Does the solution actually work?
`notes`	string	Brief explanation of scoring decisions

Score Bands

0-20: Fundamentally wrong
21-40: Major issues, partially correct
41-60: Mostly correct with notable mistakes
61-80: Correct with minor issues
81-100: Excellent, matches reference

Judge Output

Written to results/<runId>/<target>/<testId>/judge.json:

{
  "testId": "TC-001",
  "target": "node-20",
  "apiDiscovery": 85,
  "callCorrectness": 90,
  "completeness": 75,
  "functionalCorrectness": 80,
  "overallVerdict": true,
  "notes": "Found correct APIs, minor parameter issue in error handling path"
}

DNF (Did Not Finish)

If the executor produced no solution, the judge writes an all-zero score:

{ "apiDiscovery": 0, "callCorrectness": 0, "completeness": 0, "functionalCorrectness": 0, "overallVerdict": false, "notes": "No solution produced (DNF)" }

Per-Test Judge Files

File	Description
`judge.json`	Full scoring result
`judge-cmd.log`	Judge command executed
`judge-output.log`	Raw judge stdout/stderr
`judge-session.jsonl`	Judge conversation log (if available)
`judge-egress.log.json`	Judge network traffic
`judge-error.log`	Error (only on failure)

Progress Tracking

Tracked in results/<runId>/pipeline-state.json:

completed.judge["<target>"] lists judged test IDs
State saved after each test — safe to interrupt and resume

Run agentic-usability judge -p $ARGUMENTS and report the results.

For detailed internals, see pipeline-guide.md.

judge

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

judge

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Judge Solutions

Options

Scoring Dimensions

Score Bands

Judge Output

DNF (Did Not Finish)

Per-Test Judge Files

Progress Tracking

Similar Skills

Judge Solutions

Options

Scoring Dimensions

Score Bands

Judge Output

DNF (Did Not Finish)

Per-Test Judge Files

Progress Tracking

Similar Skills