From env-audit
Contamination check — compare the environment's dataset against the HuggingFace datasets the user explicitly provided. N/A (carries no weight in the rating) if the user provided none. Matching instances lower the score; a clean dataset scores high.
How this skill is triggered — by the user, by Claude, or both
Slash command
/env-audit:contaminationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Question:** does this env's dataset overlap the datasets the user cares about,
Question: does this env's dataset overlap the datasets the user cares about, so that "improvement" measured on them would be partly memorization?
Runs only against user-provided datasets. The user names the HuggingFace
datasets to check (ids like openai/gsm8k, or hf.co links — normalize a link to
its org/name id). Do not pick benchmarks yourself. If the user provided
none, output immediately:
{"name": "contamination", "status": "N/A", "score": null,
"justification": "no contamination datasets provided"}
(N/A is excluded from the rating, so the check carries no weight when skipped.)
Load each provided dataset with the datasets library (it ships with
verifiers), e.g.
python -c "from datasets import load_dataset; ds = load_dataset('openai/gsm8k', 'main', split='test')".
Use the split the user named; default to test, falling back to train.
If a dataset fails to load because it is gated / needs credentials
(error mentions gated, 401/403, authentication, token), don't skip it
silently — ask the user for an HF token (and access approval) first,
set HF_TOKEN, and retry; only proceed without it if they decline. If a
dataset can't be loaded for other reasons (bad id, offline), say so in the
justification and judge from the ones that did load; if none load, output
N/A with that reason.
Get a good slice of the env's dataset. Re-run
rlenv-audit inspect <env> -n 100 (or more) so the comparison isn't limited
to the orchestrator's 20-sample inspect.
Compare instances. For each env instance vs. each provided dataset:
Score 0–10 where 10 = clean, lower = more overlap. Anchor the score to the measured rate over the sampled rows: no same-instance matches → 9–10; isolated matches (≲5% of the sample) → 6–8 (WARN); systematic overlap (tens of percent) → ≤5; the env is the user's eval set used for training → FAIL territory. State the sample size — 0 matches in 100 rows is evidence, not proof.
Any number of datasets may be provided — audit every one independently and
score on the worst offender: four clean datasets must not dilute one
contaminated one. The justification names each dataset with its match count
(e.g. openai/gsm8k: 3 near-matches; hendrycks/MATH: clean; ...).
{"name": "contamination", "status": "PASS|WARN|FAIL", "score": <0-10>,
"justification": "<one line: datasets checked, matches found (counts) or clean>"}
npx claudepluginhub vivekvkashyap/rlenv_audit --plugin env-auditProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.