From training-monitor
Prediction-first monitoring for ML/DL training jobs. Single-agent execution with reviewer sub-agent. Derives judgment criteria from training artifacts, not hardcoded rules.
How this skill is triggered — by the user, by Claude, or both
Slash command
/training-monitor:training-monitorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are monitoring **one training job**. Execute this entire procedure from start to finish.
You are monitoring one training job. Execute this entire procedure from start to finish.
All cross-session information is stored in files, not in context.
| Operation | Path |
|---|---|
| Read previous state | monitoring-logs/jobs/<job-id>.json |
| Write current state | monitoring-logs/jobs/<job-id>.json |
| Session logs | monitoring-logs/<timestamp>/ |
Job ID = training config path + model path (stable across restarts). PIDs are NOT stable identifiers.
Per-job state uses namespace isolation:
meta: job identifier, last updated, session countmonitor: derived criteria, status history, user guidancestrategy: decisions, hypotheses, outcomes, evaluate_afterBefore starting any work, create a task for EVERY step:
TaskCreate: "Step 1: Setup + Predictions"
TaskCreate: "Step 2: Collect Evidence"
TaskCreate: "Step 3: Compare Predictions vs Actuals"
TaskCreate: "Step 4: Analyze Metrics"
TaskCreate: "Step 5: Check Resources"
TaskCreate: "Step 6: Reviewer Audit"
TaskCreate: "Step 7: Troubleshoot (if needed)"
TaskCreate: "Step 8: Strategize"
TaskCreate: "Step 9: Write State"
Mark each task in_progress when starting, completed when the gate log is written.
TaskUpdate: Step 1 -> in_progress
monitoring-logs/<timestamp>/ (format: YYYY-MM-DD_HHMMSS).monitoring-logs/<timestamp>/1-predict.mdTaskUpdate: Step 1 -> completed
TaskUpdate: Step 2 -> in_progress
Run ALL evidence collection commands in parallel. Reference: steps/2-collect.md
Gate: write monitoring-logs/<timestamp>/2-collect.md
TaskUpdate: Step 2 -> completed
TaskUpdate: Step 3 -> in_progress
Build comparison table. Flag deviations that would change your health assessment. Reference: steps/3-compare.md
Gate: write monitoring-logs/<timestamp>/3-compare.md
TaskUpdate: Step 3 -> completed
TaskUpdate: Step 4 -> in_progress
Derive judgment criteria from the training's own artifacts. Assess health with articulated reasoning. Reference: steps/4-metrics.md
Load domain skills when the condition matches (MANDATORY -- loading required, following blindly not):
| Skill | When to load |
|---|---|
grpo-monitor | GRPO, PPO, or other RL algorithms |
distributed-monitor | Multiple GPUs or processes |
k8s-monitor | Kubernetes |
wandb-monitor | Weights & Biases logging |
Gate: write monitoring-logs/<timestamp>/4-metrics.md
TaskUpdate: Step 4 -> completed
TaskUpdate: Step 5 -> in_progress
Reference: steps/5-resources.md
Gate: write monitoring-logs/<timestamp>/5-resources.md
TaskUpdate: Step 5 -> completed
TaskUpdate: Step 6 -> in_progress
Spawn a sub-agent to adversarially review your work from Steps 1-5. The reviewer checks PROCESS, not domain content.
Send the sub-agent:
agents/quality-reviewer.mdIf REJECTED: revise the flagged issues and resubmit. Maximum 2 rounds.
Reference: steps/6-review.md
Gate: write monitoring-logs/<timestamp>/6-review.md
TaskUpdate: Step 6 -> completed
Skip if: status is HEALTHY, or WARNING with no specific anomalies. Trigger if: status is CRITICAL, OR specific anomalies or deviations were found in Steps 2-5.
TaskUpdate: Step 7 -> in_progress
Investigate the anomaly systematically: observe with numbers, reproduce, isolate root cause, propose concrete action. Reference: steps/7-troubleshoot.md
Gate: write monitoring-logs/<timestamp>/7-troubleshoot.md
TaskUpdate: Step 7 -> completed
TaskUpdate: Step 8 -> in_progress
Propose next-step hypotheses based on monitoring results. This step triggers on ALL statuses:
Present options to user via AskUserQuestion. After user choice, generate execution plan. Reference: steps/8-strategy.md
Gate: write monitoring-logs/<timestamp>/8-strategy.md
TaskUpdate: Step 8 -> completed
TaskUpdate: Step 9 -> in_progress
Update per-job state file (monitoring-logs/jobs/<job-id>.json):
meta: job identifier, last updated timestamp, session countmonitor: derived criteria, current status, status historystrategy: chosen hypothesis, execution plan, success/failure criteria, evaluate_after timestampGate: write monitoring-logs/<timestamp>/9-summary.md
TaskUpdate: Step 9 -> completed
| Status | What you must provide |
|---|---|
| HEALTHY | Key progress indicator, expected behavior, baseline, evidence of progress beyond baseline. Conclusion follows from evidence. |
| WARNING | Full process completed. "I looked and found no progress," not "I didn't look." |
| CRITICAL | Specific data showing failure (NaN, process dead, metric collapsed). |
| UNCERTAIN | Highest effort. What was tried, why it failed, what would resolve it. Propose a specific question to the user. |
npx claudepluginhub t2ance/training-monitor-pluginProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.