From training-monitor
Heuristics for monitoring multi-GPU and multi-process distributed training. Common patterns, NCCL diagnostics, known failure modes. Reference knowledge, not rules.
How this skill is triggered — by the user, by Claude, or both
Slash command
/training-monitor:distributed-monitorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Heuristics for monitoring distributed training jobs (DDP, FSDP, DeepSpeed, Megatron, Ray, etc.). This skill provides **reference knowledge** about common distributed training patterns — not rules or checklists. Use it to inform your reasoning about what to check and what failure modes are known.
Heuristics for monitoring distributed training jobs (DDP, FSDP, DeepSpeed, Megatron, Ray, etc.). This skill provides reference knowledge about common distributed training patterns — not rules or checklists. Use it to inform your reasoning about what to check and what failure modes are known.
The launcher process (e.g., torchrun, Ray driver, deepspeed launcher) may be dead while worker processes continue running. Always check ALL process PIDs, not just the launcher.
# Find all related training processes
ps aux | grep <training_script_name>
# Check each worker PID
ps -p <PID1>,<PID2>,... -o pid,etime,state --no-headers
In addition to single-process stall detection:
# Check if NCCL env vars are set
env | grep NCCL
# Check for NCCL errors in logs
grep -i "nccl\|timeout\|watchdog" <log_file>
Common NCCL issues:
npx claudepluginhub t2ance/training-monitor-pluginProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.