From env-audit
Audit a Prime Intellect / verifiers RL environment before training on it. Runs six judgment-based checks (integrity, problem-statement alignment, reward design, latency, rollout quality, contamination) and produces a per-check scorecard with scores and written justifications. Use when the user wants to audit, review, vet, or quality-check an RL environment, or asks "is my env good / ready to train on?".
How this skill is triggered — by the user, by Claude, or both
Slash command
/env-audit:env-auditThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are auditing one RL environment built on the `verifiers` framework (the Prime
You are auditing one RL environment built on the verifiers framework (the Prime
Intellect Environments Hub standard). The audit is six checks, each its own
skill under skills/. Each check is judgment-heavy — you perform it yourself
using your own reasoning plus the deterministic rlenv-audit tools — and returns
a score (0–10), a status, and a one-line justification.
Ask the user (or take from their request) and confirm:
account/name, e.g.
primeintellect/gsm8k. Bare names like gsm8k are ambiguous (many accounts
publish same-named envs on the Hub) — if the user gives one, ask for the full
id before starting.curl -s -m 3 http://localhost:8000/v1/models. If it
answers, tell the user — "found a model server at http://localhost:8000/v1,
using it for the rollout checks" — and proceed with it (the model name is
auto-detected from the server). Only if nothing answers there, ask once:
"No model server found at the default vLLM address — do you have an endpoint
for the rollout checks, or should I skip them?" (rlenv-audit rollouts does
the same probe itself when called with no --endpoint.)openai/gsm8k) to check the env's dataset against. Enables check 6;
if none are given, contamination is N/A and carries no weight. Never
substitute default benchmarks of your own.Do whatever setup is missing — the user shouldn't have to prepare anything:
Tools. If rlenv-audit is not on PATH, install it:
pip install rlenv-audit (use the active venv if there is one; fall back to
pip install git+https://github.com/vivekvkashyap/RLEnv_audit.git if the
PyPI package is unavailable). This also brings in verifiers and its
vf-install command.
Find rlenv-audit's Python environment — this is the one non-obvious step.
rlenv-audit is often installed as a uvx / uv tool in its own isolated
venv (e.g. ~/.local/share/uv/tools/rlenv-audit/), which is not the
active shell Python. verifiers and vf-install live in that same venv, and
the env package must be installed into it (verifiers loads an env by
importing it, so they must share one interpreter). Resolve it once and reuse
it for every install/inspect:
head -1 "$(command -v rlenv-audit)" shows the interpreter on the launcher's
shebang; its directory is the venv's bin/. Use that bin/vf-install and
bin/python (e.g. VENV=~/.local/share/uv/tools/rlenv-audit; $VENV/bin/vf-install <account>/<env>).$VENV/bin/python -c "import <module>" (module =
the env name with -→_, last path segment), not the shell python3.
Note: most Hub envs need Python ≥ 3.11; rlenv-audit itself requires ≥ 3.11
and is best installed on 3.12 so env floors are cleared in advance.
Python-floor fallback — if vf-install <env> (or pip install of the
env) fails with a Requires-Python conflict, don't stop: read the version
the env demands from the error, build a fresh venv that satisfies it, and
move the whole audit there —
uv venv /tmp/envaudit_venv --python <X.Y> && uv pip install -p /tmp/envaudit_venv/bin/python rlenv-audit
(uv downloads the CPython if missing, no root needed), then install the env
into it and use /tmp/envaudit_venv/bin/ for every subsequent command
in this audit. Tell the user you did this and why.The environment. Try the inspect in section 2 below; if it fails with a
load/import error, install the env into the venv found above and retry:
$VENV/bin/vf-install <account>/<env> (e.g. vf-install primeintellect/gsm8k).
Classify an install failure — do not conflate two different things:
vf-install reports the id is not found on the
Hub (404 / "no matching distribution" / "could not find environment"). This
is a wrong input: stop and tell the user "There is no environment
named <account>/<name> on the Prime Intellect Hub" (quote the error), ask
them to check the id, and produce no scorecard.vf-install finds the package but the
install/build fails (dependency conflict, compile error, Python-version
mismatch), or it installs yet the module still isn't importable. Do not
claim the env doesn't exist. Report it as a setup/integrity failure: quote
the build error, say the environment exists on the Hub but could not be
installed in this environment, and stop. (If it imports but crashes on
load, that is the integrity finding in step 2, not here.)Run rlenv-audit inspect <env> -n 20 --out /tmp/envaudit_inspect.json and read
it.
Credentials gate — check this before judging anything. If the inspect JSON
carries error_kind: "auth" (or dataset_error_kind: "auth"), the failure is
missing credentials / gated access, not a broken env — a gated dataset like
GPQA is intentional. Do NOT score it, do NOT write a FAIL report. Stop and
ask the user, naming exactly what's needed, e.g.: "The env's dataset
<org/name> is gated on HuggingFace — request access at
https://huggingface.co/datasets/<org/name> and give me a token (HF_TOKEN),
or tell me to abort." When the user provides it, set it for every subsequent
command in this audit (export HF_TOKEN=... / pass it into the env of each
shell call), re-run the inspect, and resume the audit from there. Only if
the user explicitly declines: mark every check N/A with justification
"gated dataset — no credentials provided" and say the env could not be audited
on this box — missing credentials are never an env defect. The same rule
applies to any later step that fails credential-shaped (an endpoint returning
401, a judge rubric needing an API key): ask, set, re-run that step.
If the env installed but loaded is false for any other reason (it
exists yet crashes on load), that is a genuine audit finding: the
integrity check fails immediately — report that as the scorecard and stop
(the other checks can't run).
These need no model — your own judgement plus the tools. Run each by following
its skill (installed alongside this one; in the repo they live under skills/),
in order, and collect {name, status, score, justification}:
env-audit-integrityenv-audit-problem-alignmentenv-audit-reward-designenv-audit-contamination (N/A if the user provided no datasets to check)If the user gave an endpoint (or chose "dummy"):
rlenv-audit rollouts <env> --endpoint <url> --model <name> -n 20 -k 8 --out /tmp/envaudit_rollouts.json
(or --dummy for a no-endpoint dry run). This drives verifiers' own vf-eval
engine, so rollouts follow the environment's real generation path; eight
rollouts over ~20 samples, scored and timed (with per-rollout truncation and
token usage), cached to that file. vf-eval is a client — the user's served
model must be up; it does not start one.env-audit-latency and env-audit-rollout-quality skills, both
reading that single cache — do not roll out again.If there is no endpoint, mark latency and rollout_quality as N/A.
Each check scores 0–10 (one decimal allowed). Write all six results, plus a
written feedback section, to /tmp/envaudit_results.json:
{"env_id": "<env>", "checks": [
{"name": "integrity", "status": "PASS|WARN|FAIL", "score": 0-10, "justification": "..."},
{"name": "problem_alignment", "status": "PASS|WARN|FAIL", "score": 0-10, "justification": "..."},
{"name": "reward_design", "status": "...", "score": ..., "justification": "..."},
{"name": "latency", "status": "PASS|WARN|FAIL|N/A", "score": ...|null, "justification": "..."},
{"name": "rollout_quality", "status": "...", "score": ..., "justification": "..."},
{"name": "contamination", "status": "PASS|WARN|FAIL|N/A", "score": ...|null, "justification": "..."}
], "feedback": "<1-3 paragraphs>"}
feedback is 1–3 short paragraphs for the env author: first what the environment does right (be specific — cite what you observed), then what can be improved and how, in priority order. This is the part a human acts on; make every sentence earn its place.
Then rlenv-audit scorecard /tmp/envaudit_results.json to render it. The final
rating is a weighted average out of 10 over the checks that ran (N/A
excluded): latency and contamination weigh 0.5, the other four checks 1.0 — the
tool computes this for you.
Persist the audit so it outlives the session (skip only if the user says not to
save). Create rlenv_audit_reports/<account>__<name>/ in the working directory
and write two files:
report.json — the machine-readable result:
rlenv-audit scorecard /tmp/envaudit_results.json --json > rlenv_audit_reports/<account>__<name>/report.json
(the computed scorecard: checks, grade, rating, feedback).report.md — the human-readable report, which you author:
# rlenv_audit — <account>/<name>) and the date;rating: N.N/10;End by telling the user where the report was saved. If the grade is WARN or
FAIL, you may offer once: "Want me to apply the mechanical fixes to a local
copy? (env-audit-repair — it never touches the installed env or the Hub.)"
Run the repair skill only if the user explicitly says yes — never as part of
the audit itself.
npx claudepluginhub vivekvkashyap/rlenv_audit --plugin env-auditProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.