Audit a Prime Intellect / verifiers RL environment before training on it. Six judgment-based checks (integrity, problem-statement alignment, reward design, latency, rollout quality, contamination) run by the agent, producing a scorecard with scores and written justifications, plus an opt-in env-repair skill that applies the feedback to a local copy.
Contamination check — compare the environment's dataset against the HuggingFace datasets the user explicitly provided. N/A (carries no weight in the rating) if the user provided none. Matching instances lower the score; a clean dataset scores high.
Audit a Prime Intellect / verifiers RL environment before training on it. Runs six judgment-based checks (integrity, problem-statement alignment, reward design, latency, rollout quality, contamination) and produces a per-check scorecard with scores and written justifications. Use when the user wants to audit, review, vet, or quality-check an RL environment, or asks "is my env good / ready to train on?".
Repair an audited RL environment from its scorecard feedback — only when the user explicitly asks to rewrite/repair/fix the env. Applies mechanical fixes to a local copy under rlenv_audit_repairs/ (never the installed package, never the Hub), triages design-level findings into recommendations, and validates the repaired copy by re-running the cheap checks. Publishing the result is the user's job.
Integrity check for an RL environment — verify it is written properly and actually runs. Confirms the dataset loads and is well-formed, the reward function is present and callable, the code follows verifiers/prime-intellect conventions, and there are no missing fields or broken imports. The "does it even run and is it shaped right" check.
Latency check — measure how long rollouts take end to end. Requires a model endpoint. Reads the shared cached rollout set (8 rollouts over ~20 samples) and reports timing; does not run its own rollouts.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
rlenv_audit audits verifiers RL environments from the Prime Intellect Hub before you train on them. A broken reward function doesn't crash, it silently teaches the policy garbage. Point an agent (Claude Code / Codex) at an environment: it runs six checks and returns a scorecard out of 10 with written feedback on what to improve.
# Install the skills (pick one)
uvx --python 3.12 rlenv-audit install-skills
pip install rlenv-audit && rlenv-audit install-skills # needs Python >= 3.11
Why --python 3.12: a Hub env must install into the same interpreter as the
audit tool, and envs declare Python floors (most >=3.11, some higher) — a
3.12 venv clears nearly all of them in one go.
Then ask your agent, giving the full environment id (account/name; bare
names like gsm8k are ambiguous on the Hub), your problem statement, and
optionally a model endpoint and the HuggingFace datasets to check
contamination against:
prompt
Audit primeintellect/gsm8k. I'm trying to train a grade-school math solver.
Check contamination against openai/gsm8k.
(in Claude Code or Codex)
If a vLLM server is up on the default address (http://localhost:8000/v1), the
audit finds it by itself — endpoint and model name are auto-detected, and it
tells you what it found. Serving somewhere else? Name it in the prompt:
Use my vLLM endpoint at http://localhost:8000/v1, model Qwen2.5-7B. An
explicitly named endpoint always wins; with no endpoint given and nothing on
the default address, checks 4 & 5 are N/A.
The scorecard, one row per check, each scored out of 10, plus one final score and written feedback:
rlenv_audit · primeintellect/gsm8k
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ check ┃ status ┃ score ┃ justification ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ integrity │ PASS │ 9.5 │ loads, reward callable, well-formed │
│ problem_alignment │ PASS │ 9.0 │ dataset/reward match the stated goal │
│ reward_design │ PASS │ 8.8 │ discriminates; matches judgment 18/20 │
│ latency │ PASS │ 8.5 │ mean 2.1s / p90 4.3s, no errors │
│ rollout_quality │ PASS │ 8.0 │ prompt clear; 6% truncated rollouts │
│ contamination │ WARN │ 6.0 │ 3 near-matches with openai/gsm8k test │
└───────────────────┴────────┴───────┴─────────────────────────────────────────┘
overall: WARN rating: 8.5/10
feedback
The environment is solidly built: it loads cleanly, the reward is a real
verifier (boxed-answer extraction + math equivalence, not a stub), and it
discriminates well: correct completions scored 1.0 and every wrong or
malformed probe scored 0.0, matching my own judgment on 18 of 20 cases.
The main thing to improve is contamination: 3 of the sampled training
instances near-match the openai/gsm8k test split you asked me to check, so
benchmark gains may partly be memorization; either dedupe against that test
split or report on a different set. Second, the parser only accepts \boxed{}
answers; consider accepting plain final-line answers too, or the policy gets
zero reward for correct-but-unformatted output early in training.
FAIL on any check fails the audit.rlenv_audit_reports/<account>__<name>/report.md (human-readable) and
report.json (machine-readable) in your working directory, so you can commit
it, share it, or diff it against a re-audit after fixes.npx claudepluginhub vivekvkashyap/rlenv_audit --plugin env-auditComprehensive skill pack with 66 specialized skills for full-stack developers: 12 language experts (Python, TypeScript, Go, Rust, C++, Swift, Kotlin, C#, PHP, Java, SQL, JavaScript), 10 backend frameworks, 6 frontend/mobile, plus infrastructure, DevOps, security, and testing. Features progressive disclosure architecture for 50% faster loading.
A growing collection of Claude-compatible academic workflow bundles. Covers scientific figures, manuscript writing and polishing, reviewer assessment, citation retrieval, data availability, paper reading, literature search, response letters, paper-to-PPTX conversion, and evidence-grounded Chinese invention patent drafting. Rules are organized as reusable skill folders with explicit workflows and quality checks.
Develop, test, build, and deploy Godot 4.x games with Claude Code. Includes GdUnit4 testing, web/desktop exports, CI/CD pipelines, and deployment to Vercel/GitHub Pages/itch.io.
Comprehensive PR review agents specializing in comments, tests, error handling, type design, code quality, and code simplification
Comprehensive feature development workflow with specialized agents for codebase exploration, architecture design, and quality review
UI/UX design intelligence. 67 styles, 161 palettes, 57 font pairings, 25 charts, 15 stacks (React, Next.js, Vue, Svelte, Astro, SwiftUI, React Native, Flutter, Tailwind, shadcn/ui, Nuxt, Jetpack Compose). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient.