From Leerie
Evaluates a batch of leerie LLM call captures against a 3-dimensional rubric (schema adherence, factual grounding, hallucination-freeness) and writes verdict JSON for worker quality measurement.
How this skill is triggered — by the user, by Claude, or both
Slash command
/leerie:judge-llm-batch <path-to-calls.ndjson> --call-type <type> [--run-id <run-id>] [--out <verdict-path>]<path-to-calls.ndjson> --call-type <type> [--run-id <run-id>] [--out <verdict-path>]This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
<objective>
The calls.ndjson file lives at .leerie/runs/<run-id>/calls.ndjson.
Output shape:
{
"call_type": "<from filter>",
"run_id": "<from file>",
"judged_at": "<ISO-8601>",
"judge_model": "claude-sonnet-4-6",
"verdicts": [
{
"call_id": "<UUID from the capture>",
"schema_ok": true,
"schema_rationale": "<one sentence>",
"factually_grounded": true,
"factual_rationale": "<one sentence>",
"hallucination_free": true,
"hallucination_rationale": "<one sentence>",
"pass": true,
"worst_offender": "<optional — 1-line quote when any dimension failed>"
}
],
"aggregate": {
"n": 0,
"schema_pass": 0,
"factual_pass": 0,
"hallucination_free_pass": 0,
"overall_pass": 0
}
}
pass is true only when all three dimensions are true.
<execution_context>
Arguments parsed from $ARGUMENTS:
calls.ndjson (required). Can also be a
.leerie/runs/<run-id>/ directory — the skill will find
calls.ndjson inside it.--call-type <name> (required): one of classifier, planner,
reconciler, implementer, integrator, conformer. Filters the
NDJSON to only lines with this call_type value.--run-id <id> (optional): if provided, resolves the path as
.leerie/runs/<run-id>/calls.ndjson relative to CWD.--out <path> (optional): explicit verdict output path; defaults to
<ndjson-dir>/judge-out/<call_type>-verdicts.json.The NDJSON line shape (from IMPLEMENTATION.md §10):
{
"call_id": "<UUID v4>",
"run_id": "<str>",
"call_type": "<str>",
"model": "<str>",
"system_prompt": "<str>",
"user_content": "<str>",
"response_content": "<str>",
"parsed_ok": true,
"input_tokens": 0,
"output_tokens": 0,
"latency_ms": 0,
"success": true,
"ts": "<ISO-8601>"
}
</execution_context>
Leerie records every `claude -p` worker invocation to a NDJSON telemetry file immediately after each call returns. The file is append-only and is valid NDJSON through the last complete line even under a hard kill.The judge skill operates post-run: it reads the archive, scores a batch, and writes verdicts. The llm-self-heal skill consumes these verdicts to propose prompt patches.
Each call_type maps to exactly one system-prompt source
(IMPLEMENTATION.md §10 call_type → prompt table):
| call_type | Prompt source |
|---|---|
| classifier | prompts/classifier.md |
| planner | prompts/planner.md |
| reconciler | prompts/reconciler.md |
| implementer | prompts/implementer.md |
| integrator | prompts/integrator.md |
| conformer | prompts/conformer.md |
The system_prompt field in the NDJSON capture is the actual verbatim
text injected — so the judge can always derive what the worker was asked
to do from the capture alone.
Resolve the input path from $ARGUMENTS. If the path is a directory,
append /calls.ndjson. Read line-by-line and parse each line as JSON.
Filter to lines where call_type matches the --call-type argument.
If zero lines match, emit: No captures for call_type=<name> in <path> and stop.
Before scoring everything, read ONE sample carefully:
system_prompt asking the worker to do?user_content provide?This calibration is load-bearing — without it you grade against a guessed rubric instead of the actual contract.
schema_ok)Question: Does response_content conform to the structured-output
schema the system_prompt requires?
Pass criteria:
parsed_ok=false automatically fail)call_typeFail criteria: Missing required fields, wrong types, enum violations,
extra undeclared fields, malformed JSON, parsed_ok=false.
factually_grounded)Question: Are the substantive claims in response_content grounded
in the user_content the worker received?
Pass criteria:
Fail criteria: Claims demonstrably contradicted by the input; counts that don't match what is visible; categorical errors.
hallucination_free)Question: Is the output free of fabricated content — references to
things absent from both user_content and system_prompt?
Pass criteria:
Fail criteria: References to files, functions, or concepts not in
the input. Include worst_offender with the fabricated phrase.
Counts:
n: total samples judgedschema_pass: count of schema_ok=truefactual_pass: count of factually_grounded=truehallucination_free_pass: count of hallucination_free=trueoverall_pass: count of pass=true (all three dimensions true)Resolve the output path (default: <ndjson-dir>/judge-out/<call_type>-verdicts.json).
Create parent directories if needed. Write pretty-printed JSON (2-space indent).
[<call_type>] judged n=<N> schema=<S>/<N> factual=<F>/<N> halluc_free=<H>/<N> pass=<P>/<N> → <out_path>
Then stop.
<safety_constraints>
claude -p callsGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub enricai/leerie --plugin leerie