Skill

rl-task-audit

Audit an RL training codebase to infer task definition, training/eval commands, rewards, terminations, curriculum, domain randomization, logs, and metrics before planning experiments. Use for robotics/RL codebases, reward tuning, curriculum design, domain-randomization design, or ambiguous task requests.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/rl-experiment-assistant:rl-task-audit

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are auditing an RL training repository. Your output is a grounded task card, not an experiment proposal yet.

SKILL.md

101 lines · ~1.4k tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitMay 14, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Mandatory behavior

Inspect the target RL codebase before inferring the task. Prefer source files, configs, README/training guides, launcher scripts, reward code, termination code, curriculum/randomization configs, and evaluation scripts.
Identify the canonical adapter fields from templates/adapter-template.yaml, especially environment.manager, environment.command_prefix, commands.train_entrypoint, commands.train_command_raw, commands.train_command_runnable, commands.fragments, paths, metrics, hardware, and contract_gate.required_before_training.
Separate facts from assumptions. Mark uncertain items explicitly.
Do not launch GPU-consuming training, mamba, simulators, W&B/network calls, or target train modules from this skill.
If the user supplied a high-level objective, reconcile it against the code-derived task definition and ask for confirmation before continuing to experiment planning.
If the user supplied setup details, record them as high-confidence user-provided adapter facts while still verifying code support.
Treat .rlxp/contract.yaml as the launch authority. Audit/setup may only create status: draft_blocked or prepare ready_for_user_confirmation; it must not approve launch.
Run setup/scan helpers yourself when local shell access and edit permission are available. Do not tell the user to run bundled Python scripts unless the agent cannot execute local commands.

Plugin-root resolution

Resolve <plugin-root> to the installed rl-experiment-assistant plugin directory:

Codex: infer from the loaded skill path, installed plugin path, or user-provided <plugin-root>.
Claude Code: use ${CLAUDE_PLUGIN_ROOT} when present; otherwise infer from the loaded plugin/skill path.

What to inspect

Look for:

Training entry points: train.py, train_agent.py, runner.py, scripts/train*, launch*.
Eval entry points: eval.py, eval_agent.py, play.py, replay.py, rollout.py.
Config systems: Hydra, Tyro, argparse, YAML, Gin, dataclasses, config groups.
Reward terms: reward, rewards, terms, tracking, contact, penalty, regularization, alive, termination.
Curriculum/adaptive sampling: curriculum, difficulty, level, adaptive, sampler, hard, success.
Domain randomization: randomization, domain_rand, friction, mass, com, noise, push, latency.
Metrics/logging: WandB, TensorBoard, CSV, JSON, stdout tables, videos, checkpoints.
Hardware launchers: torchrun, slurm, sbatch, tmux, ray, CUDA_VISIBLE_DEVICES, Kubernetes.

User-provided setup facts

When the user provides a concrete environment or command, preserve it verbatim in the audit as intent, not permission to train:

Mamba/conda environment, especially hssim.

Raw training command, especially:

python -m holosoma.train_agent exp:g1-29dof-scene-traversal-hurdle logger:wandb-packman-scene-traversal

Runnable command prefix, especially mamba run -n hssim.
Config fragments such as exp:g1-29dof-scene-traversal-hurdle and logger:wandb-packman-scene-traversal.

For Holosoma scene traversal, prefer the module entry point python -m holosoma.train_agent over a source-file path. Treat the full mamba-wrapped command as a launch candidate only.

If .rlxp/ is missing and files may be edited, initialize it in the target repository before writing the report. Use the bundled helpers as internal agent tools:

python <plugin-root>/scripts/rlxp_init.py --root . --project-name <name>
python <plugin-root>/scripts/rlxp_scan.py --root . --out .rlxp/scan.json

If the current working directory is not the target RL repository, pass the absolute target path to --root. Never write .rlxp/ into the plugin package, Codex plugin cache, Claude plugin cache, or agent home directory unless that is explicitly the target repository.

Output: Task Consensus Card

Produce this exact structure in the conversation and write it into .rlxp/report.md if project files may be edited:

# Task Consensus Card

## Codebase-derived task definition

## User objective interpreted as

## Training command candidates

## Evaluation command candidates

## Reward / termination / curriculum / DR surfaces found

## Candidate primary metric

## Candidate guardrail metrics

## Candidate tuning scope

## Required user confirmations before launch

1. Is the task definition correct?
2. Which primary metric should define improvement?
3. Which guardrails should block acceptance?
4. Which reward/curriculum/DR surfaces are allowed to change?
5. What is the GPU/wall-clock/iteration budget?
6. Which machines/GPU IDs may be used?
7. Which baseline command and evaluation protocol are approved?

Evidence quality

Assign each finding a confidence level: high if directly from code/config or user-provided setup, medium if inferred from nearby code, low if guessed from naming. User-provided command facts are high-confidence for intent; code support remains unverified until inspected in the target repository.

rl-task-audit

Invocation

Context Preview

SKILL.md

rl-task-audit

Invocation

Context Preview

SKILL.md

Mandatory behavior

Plugin-root resolution

What to inspect

User-provided setup facts

Output: Task Consensus Card

Evidence quality

Similar Skills

Mandatory behavior

Plugin-root resolution

What to inspect

User-provided setup facts

Output: Task Consensus Card

Evidence quality

Similar Skills