Skill

eval

Run the full evaluation pipeline (execute, judge, report) for SDK usability benchmarks. Supports resume, status checks, and labeling.

automation

testing

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/agentic-usability:eval [project-directory] [--resume] [--fresh] [--label name]

User invocable

Model invocation disabled

Inline context

Default effort

Uses dynamic context injection — preprocesses shell commands at runtime

Argument hint[project-directory] [--resume] [--fresh] [--label name]

Tool Access

This skill is limited to the following tools:

Bash(agentic-usability *)ReadGlob

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Run the complete benchmark pipeline: **execute → judge → report**.

SKILL.md

85 lines · ~802 tokens

Stats

LanguageTypeScript

Stars15

MaintenanceExcellent

Last CommitJun 11, 2026

Actions

View Source View Plugin View on GitHub View README

Run Full Evaluation Pipeline

Run the complete benchmark pipeline: execute → judge → report.

echo "Arguments: $ARGUMENTS"

Pipeline Stages

Execute: For each test case × target, spins up a sandboxed VM, has an AI agent solve the problem, extracts solution files
Judge: For each test case × target, an LLM judge compares the generated solution to the reference solution and scores it
Report: Aggregates all judge scores into a terminal scorecard and writes report.json

Options

--resume: Resume from the last checkpoint of an interrupted pipeline
--fresh: Only useful with --resume. Resets pipeline state so the run re-executes from scratch in the same run directory. Does NOT delete result files. Without --resume, a new run always starts fresh anyway.
--label <name>: Human-readable label for this run
--run <runId>: Only used with --resume. Target a specific run instead of auto-detecting the latest incomplete one.

Detecting Pipeline Status

Before running, you can check if a pipeline is paused/interrupted by reading the pipeline state file:

Pipeline state location: <project>/results/<runId>/pipeline-state.json

{
  "stage": "execute",
  "startedAt": "2026-04-25T10:30:00.000Z",
  "testCases": 15,
  "completed": {
    "execute": { "node-20": ["TC-001", "TC-002"] },
    "judge": { "node-20": [] }
  }
}

How to check status:

stage is "execute" or "judge" → pipeline is incomplete/paused
stage is "report" → pipeline completed successfully
Compare completed[stage][target].length vs testCases to see progress
No report.json in the run directory → pipeline didn't finish
List runs: look for subdirectories in results/ containing run.json

Run manifest (results/<runId>/run.json):

{
  "id": "run-2026-04-25T10-30-00-000Z",
  "createdAt": "2026-04-25T10:30:00.000Z",
  "targets": ["node-20"],
  "testCount": 15,
  "label": "baseline v2"
}

How Resume Works

When --resume is passed:

Finds the latest incomplete run (where stage !== "report"), or uses --run <id>
Loads the saved pipeline state
Skips completed stages entirely (e.g., if stage="judge", execute is skipped)
Within a stage: only runs tests not yet in the completed map for each target
Progress is saved after each individual test — safe against crashes

Abort Handling

First Ctrl+C: Graceful — finishes current test, saves state, prints "use --resume to continue"
Second Ctrl+C: Hard exit — immediate process termination

Running the Pipeline

Run agentic-usability eval -p $ARGUMENTS and monitor the output. If interrupted, suggest --resume to continue.

For detailed pipeline internals, see pipeline-guide.md.

eval

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

eval

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Run Full Evaluation Pipeline

Pipeline Stages

Options

Detecting Pipeline Status

How Resume Works

Abort Handling

Running the Pipeline

Similar Skills

Run Full Evaluation Pipeline

Pipeline Stages

Options

Detecting Pipeline Status

How Resume Works

Abort Handling

Running the Pipeline

Similar Skills