Stats

Actions

Available In

Tags

rlenv_audit

rlenv_audit audits verifiers RL environments from the Prime Intellect Hub before you train on them. A broken reward function doesn't crash, it silently teaches the policy garbage. Point an agent (Claude Code / Codex) at an environment: it runs six checks and returns a scorecard out of 10 with written feedback on what to improve.

Quickstart

# Install the skills (pick one) uvx --python 3.12 rlenv-audit install-skills pip install rlenv-audit && rlenv-audit install-skills # needs Python >= 3.11

Why --python 3.12: a Hub env must install into the same interpreter as the audit tool, and envs declare Python floors (most >=3.11, some higher) — a 3.12 venv clears nearly all of them in one go.

Then ask your agent, giving the full environment id (account/name; bare names like gsm8k are ambiguous on the Hub), your problem statement, and optionally a model endpoint and the HuggingFace datasets to check contamination against:

prompt

Audit primeintellect/gsm8k. I'm trying to train a grade-school math solver. Check contamination against openai/gsm8k.

(in Claude Code or Codex)

If a vLLM server is up on the default address (http://localhost:8000/v1), the audit finds it by itself — endpoint and model name are auto-detected, and it tells you what it found. Serving somewhere else? Name it in the prompt: Use my vLLM endpoint at http://localhost:8000/v1, model Qwen2.5-7B. An explicitly named endpoint always wins; with no endpoint given and nothing on the default address, checks 4 & 5 are N/A.

Output

The scorecard, one row per check, each scored out of 10, plus one final score and written feedback:

rlenv_audit · primeintellect/gsm8k ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ check ┃ status ┃ score ┃ justification ┃ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ integrity │ PASS │ 9.5 │ loads, reward callable, well-formed │ │ problem_alignment │ PASS │ 9.0 │ dataset/reward match the stated goal │ │ reward_design │ PASS │ 8.8 │ discriminates; matches judgment 18/20 │ │ latency │ PASS │ 8.5 │ mean 2.1s / p90 4.3s, no errors │ │ rollout_quality │ PASS │ 8.0 │ prompt clear; 6% truncated rollouts │ │ contamination │ WARN │ 6.0 │ 3 near-matches with openai/gsm8k test │ └───────────────────┴────────┴───────┴─────────────────────────────────────────┘ overall: WARN rating: 8.5/10 feedback The environment is solidly built: it loads cleanly, the reward is a real verifier (boxed-answer extraction + math equivalence, not a stub), and it discriminates well: correct completions scored 1.0 and every wrong or malformed probe scored 0.0, matching my own judgment on 18 of 20 cases. The main thing to improve is contamination: 3 of the sampled training instances near-match the openai/gsm8k test split you asked me to check, so benchmark gains may partly be memorization; either dedupe against that test split or report on a different set. Second, the parser only accepts \boxed{} answers; consider accepting plain final-line answers too, or the policy gets zero reward for correct-but-unformatted output early in training.

Final score: a weighted average out of 10 over the checks that ran (N/A carries no weight). Latency and contamination weigh 0.5 each, the other four checks 1.0.

Feedback: 1 to 3 paragraphs, what the env does right first, then what to improve, in priority order.

A FAIL on any check fails the audit.

The full report is also saved to rlenv_audit_reports/<account>__<name>/report.md (human-readable) and report.json (machine-readable) in your working directory, so you can commit it, share it, or diff it against a re-audit after fixes.

The six checks

rlenv_audit

Quickstart

# Install the skills (pick one)
uvx --python 3.12 rlenv-audit install-skills
pip install rlenv-audit && rlenv-audit install-skills   # needs Python >= 3.11

prompt

Audit primeintellect/gsm8k. I'm trying to train a grade-school math solver.
Check contamination against openai/gsm8k.

(in Claude Code or Codex)

Output

The scorecard, one row per check, each scored out of 10, plus one final score and written feedback:

                       rlenv_audit · primeintellect/gsm8k
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ check             ┃ status ┃ score ┃ justification                           ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ integrity         │ PASS   │   9.5 │ loads, reward callable, well-formed     │
│ problem_alignment │ PASS   │   9.0 │ dataset/reward match the stated goal    │
│ reward_design     │ PASS   │   8.8 │ discriminates; matches judgment 18/20   │
│ latency           │ PASS   │   8.5 │ mean 2.1s / p90 4.3s, no errors         │
│ rollout_quality   │ PASS   │   8.0 │ prompt clear; 6% truncated rollouts     │
│ contamination     │ WARN   │   6.0 │ 3 near-matches with openai/gsm8k test   │
└───────────────────┴────────┴───────┴─────────────────────────────────────────┘
overall: WARN   rating: 8.5/10

feedback
The environment is solidly built: it loads cleanly, the reward is a real
verifier (boxed-answer extraction + math equivalence, not a stub), and it
discriminates well: correct completions scored 1.0 and every wrong or
malformed probe scored 0.0, matching my own judgment on 18 of 20 cases.

The main thing to improve is contamination: 3 of the sampled training
instances near-match the openai/gsm8k test split you asked me to check, so
benchmark gains may partly be memorization; either dedupe against that test
split or report on a different set. Second, the parser only accepts \boxed{}
answers; consider accepting plain final-line answers too, or the policy gets
zero reward for correct-but-unformatted output early in training.

Final score: a weighted average out of 10 over the checks that ran (N/A carries no weight). Latency and contamination weigh 0.5 each, the other four checks 1.0.
Feedback: 1 to 3 paragraphs, what the env does right first, then what to improve, in priority order.
A FAIL on any check fails the audit.
The full report is also saved to rlenv_audit_reports/<account>__<name>/report.md (human-readable) and report.json (machine-readable) in your working directory, so you can commit it, share it, or diff it against a re-audit after fixes.

env-audit

Popularity

What's Inside

Confidence

README

rlenv_audit

Quickstart

Output

The six checks

Similar Plugins

fullstack-dev-skills

nature-skills

godot-skills

pr-review-toolkit

feature-dev

ui-ux-pro-max

rlenv_audit

Quickstart

Output

The six checks

Popularity

Health & Quality

Similar Plugins

fullstack-dev-skills

nature-skills

godot-skills

pr-review-toolkit

feature-dev

ui-ux-pro-max