From worldjen
Benchmark or evaluate an entire AI model. Create, list, inspect, and compare WorldJen Bench runs — large-scale evaluation across many prompts and dimensions, executed on a GPU worker queue. Invoke ONLY when the target is a whole model — a model id, a checkpoint, a Hugging Face repo, "the model", "my model", comparing two models, tracking drift across model versions, or gating CI on a model's benchmark. To score an individual clip, video, or image, use `worldjen-score` instead, NOT this skill. Backed by `worldjen bench create / list / get / cancel / delete / logs / csv / videos / download-videos`. NOT for runner host setup (use `worldjen-runner`) or ranking clips that share a prompt (use `worldjen-rank`).
How this skill is triggered — by the user, by Claude, or both
Slash command
/worldjen:worldjen-benchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Best-effort update check. Fails silently on network errors or non-marketplace installs. If output starts with `UPGRADE_AVAILABLE` or `JUST_UPGRADED`, surface it once to the user and continue with the skill workflow.
Best-effort update check. Fails silently on network errors or non-marketplace installs. If output starts with UPGRADE_AVAILABLE or JUST_UPGRADED, surface it once to the user and continue with the skill workflow.
{
for _p in \
"$HOME/.claude/plugins/marketplaces/worldjen/bin/check-update" \
"$HOME/.codex/.tmp/plugins/plugins/worldjen/bin/check-update"; do
if [ -x "$_p" ]; then "$_p" 2>/dev/null && break; fi
done
} 2>/dev/null || true
"How does the model perform overall?" Bench runs a full benchmark at scale: many prompts, your chosen dimensions, optionally multiple models compared, executed on a GPU worker queue.
This skill is for Bench run lifecycle. For setting up the GPU worker host that executes runs, use the worldjen-runner skill. For single-clip scoring, use worldjen-score; for prompt-locked ranking, use worldjen-rank.
Set WORLDJEN_API_KEY (get one at https://www.worldjen.com/settings/api-keys), or pass --api-key.
If worldjen is not installed yet, run pip install worldjen first (or use the worldjen-install skill).
Don't invent IDs — list them:
worldjen dimensions list --json
worldjen models list
worldjen runner list
worldjen bench create \
--name "<RUN_NAME>" \
--dimensions "<DIM1,DIM2>" \
--runner-id "<RUNNER_ID>" \
--model-id "<MODEL_ID>"
Optional flags:
--reference-model-id <REF_MODEL_ID> — compare against another model in the same run--reasoning — store the per-dimension reasoning and generated summaries (off by default)--run-instructions <JSON_OR_PATH> — runner pipeline options (inline JSON or path to a .json)worldjen bench list --json
worldjen bench list --status running # filter; also --page / --limit
worldjen bench get <RUN_ID> --json
worldjen bench logs <RUN_ID>
worldjen bench videos <RUN_ID> --json
worldjen bench csv <RUN_ID> -o results.csv # scorecard
worldjen bench download-videos <RUN_ID> -o ./videos
Fetch both runs as JSON and diff the per-dimension scores. Useful for catching regressions across model checkpoints (v3 → v4 → v5):
worldjen bench get <RUN_ID_A> --json
worldjen bench get <RUN_ID_B> --json
Flag any dimension where the score dropped by your threshold (5pp is a common gate for CI).
worldjen.bench.create(...) — enqueue a run on a runner you provisioned; the call returns immediately and the worker handles generation, upload, and scoring.worldjen.bench.run_with_pipeline(...) — in-process variant for scripts that already have the model loaded in Python; the SDK drives generation, upload, and (optionally) waits for evaluator scores. There is no CLI equivalent — Python callables don't survive shell arguments.The CLI and SDK both wrap /api/v1/bench. Use the X-API-Key header and an Idempotency-Key on POSTs so retries don't double-create.
worldjen bench cancel <RUN_ID> — stops a queued or in-flight run; partial results may be lostworldjen bench delete <RUN_ID> — permanently removes the run record and artifactsConfirm with the user before either.
WORLDJEN_API_KEYworldjen-score, not a full Bench runworldjen-install — install the SDK and CLIworldjen-runner — set up the GPU worker hostworldjen-score — raw per-dimension scores for a single clipworldjen-rank — comparative ranking of clips that share a promptworldjen-leaderboard — public leaderboard (no auth)For examples and troubleshooting, see references/examples.md.
npx claudepluginhub moonmath-ai/worldjen-skills --plugin worldjenProvides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.