Skill

rl-experiment-loop

Execute the evidence-gated RL experiment loop: launch approved training/eval jobs, parse results, update the report, decide accept/reject/inconclusive, and propose the next experiment within the approved budget.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/rl-experiment-assistant:rl-experiment-loop

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Run the iterative experiment loop only after `.rlxp/contract.yaml` and `.rlxp/report.md` record task, metrics, tuning scope, budget, hardware, baseline command, and evaluation protocol.

SKILL.md

91 lines · ~1.2k tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitMay 14, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Loop state

Use these files as the source of truth:

.rlxp/adapter.yaml: canonical repo-specific command/config/log mapping.
.rlxp/contract.yaml: launch gate for confirmed task, metrics, tuning scope, budget, hardware, baseline, and eval protocol.
.rlxp/report.md: living human-readable report.
.rlxp/experiments.yaml: proposed and approved experiment queue.
.rlxp/ledger.jsonl: append-only machine-readable run ledger.
.rlxp/runs/<experiment_id>/: logs, command, config diff, metrics, checkpoints, videos, result summary.

If .rlxp/contract.yaml is missing, incomplete, contradicts .rlxp/adapter.yaml, lacks status: approved_for_launch, or lacks derived training_allowed: true, stop and ask for confirmation. Do not treat an approved experiment entry as sufficient by itself.

Execution protocol

For each approved experiment:

Re-read .rlxp/contract.yaml immediately before launch.
Verify contract status is approved_for_launch, all required confirmations are explicit, approval record is complete, budget remains, hardware target is available, and the queued experiment is inside approved scope.
Re-read .rlxp/adapter.yaml; prefer commands.train_command_runnable or commands.runnable_train_template.
Create an isolated run directory and, if editing code, use a git branch/worktree or clean patch when the target repo is version-controlled.
Record command, git commit if applicable, diff, environment, seed(s), GPU IDs, and expected output paths before launch.

For Holosoma scene traversal, the canonical raw command is:

python -m holosoma.train_agent exp:g1-29dof-scene-traversal-hurdle logger:wandb-packman-scene-traversal

and the runnable command is:

mamba run -n hssim python -m holosoma.train_agent exp:g1-29dof-scene-traversal-hurdle logger:wandb-packman-scene-traversal

Re-read the contract again after any code/config patch. Block if scope, baseline, eval protocol, or budget no longer match.
Run dry-run/smoke-test only if it is approved by the contract and inside budget.
Launch training only within approved budget.
Capture stdout/stderr and structured logs.
Run evaluation using the approved eval protocol.
Parse metrics, using scripts/rlxp_score.py internally when it fits the log format, and update .rlxp/report.md and .rlxp/ledger.jsonl.

When subagents are available, use independent reviewers for metric validity and reward/curriculum/DR risk before accepting a run or launching a changed reward/curriculum configuration.

Decision rule

Classify each run as:

accept: primary score improves beyond the agreed threshold and guardrails pass.
reject: score regresses, guardrails fail, instability appears, or evidence shows reward hacking.
inconclusive: result is noisy, incomplete, or not comparable; propose replication or metric repair.
debug: command fails, training crashes, NaN appears, logging is broken, or evaluation is invalid.

Never accept a change solely because training reward increased.

Next-experiment policy

Choose the next experiment from evidence:

If baseline is unstable, fix launcher/config/logging before optimizing.
If reward terms are badly scaled but task metrics correlate with reward, tune reward parameters.
If task success is low because the policy never reaches useful states, adjust curriculum/adaptive sampling.
If sim-to-real robustness or scene diversity fails, adjust DR or scene sampling.
If metrics show a missing behavioral signal, propose reward engineering with static tests and rollout diagnostics.
If a change improves mean but hurts worst-bin performance, use targeted curriculum or worst-bin sampling.

Output

After each loop, produce:

# Experiment Update

## Run summary
## Metric comparison against baseline/current best
## Guardrail status
## Decision
## Evidence for decision
## Remaining budget
## Next proposed experiment
## Report updates

rl-experiment-loop

Invocation

Context Preview

SKILL.md

rl-experiment-loop

Invocation

Context Preview

SKILL.md

Loop state

Execution protocol

Decision rule

Next-experiment policy

Output

Similar Skills

Loop state

Execution protocol

Decision rule

Next-experiment policy

Output

Similar Skills