From training-monitor
Heuristics for monitoring GRPO, PPO, and other RL training. Common patterns, typical indicators, known failure modes. Reference knowledge, not rules.
How this skill is triggered — by the user, by Claude, or both
Slash command
/training-monitor:grpo-monitorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Heuristics for monitoring RL training jobs (GRPO, PPO, RLOO, etc.). This skill provides **reference knowledge** about common patterns in RL training — not rules or checklists. Use it to inform your reasoning, not as a substitute for reasoning.
Heuristics for monitoring RL training jobs (GRPO, PPO, RLOO, etc.). This skill provides reference knowledge about common patterns in RL training — not rules or checklists. Use it to inform your reasoning, not as a substitute for reasoning.
When used alongside training-monitor, the monitor agent derives its own judgment criteria from the training artifacts. This skill tells you what RL practitioners commonly look at and what failure modes are known — but the agent must judge whether these apply to the specific training at hand.
RL training has a generation/rollout phase not present in standard training:
A healthy GRPO step time breakdown:
| Phase | Expected % | If higher |
|---|---|---|
| Rollout generation | 5-15% | Increase inference throughput |
| Log prob computation (old + ref) | 20-40% | Increase micro_batch_size |
| Actor update (backward + optimizer) | 30-50% | This is the useful work |
| Checkpoint save | <10% | Reduce save_freq or use faster storage |
If any single phase is >60%, it is the bottleneck. Focus optimization there.
In addition to general metric anomalies (NaN, loss=0, gradient explosion):
Detect: response_length suddenly drops to minimum, OR response_length hits max_response_length for >50% of samples (clip_ratio > 0.5), OR reward becomes constant (always 0 or always 1).
Root causes:
Action: Check actual generated text (not just metrics). Try these methods in order to find generated samples:
log_completions, OpenRLHF's sample saving)If no generated text is available through any method, report this as a monitoring limitation and recommend the user enable sample logging for future sessions.
Is the model producing coherent, diverse outputs? If collapsed, this likely requires a config change (temperature, KL coef, reward function) — use team-decision-review before changing.
In addition to general overfitting actions (early stopping, regularization, reduce LR):
Detect: container RAM usage spikes during checkpoint merge+upload, training step time increases during merge, merge process fails with OOM.
Action:
ps aux | grep -E "model_merger|main_ppo".npx claudepluginhub t2ance/training-monitor-pluginProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.