Skill

training-monitor

Prediction-first monitoring for ML/DL training jobs. Single-agent execution with reviewer sub-agent. Derives judgment criteria from training artifacts, not hardcoded rules.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/training-monitor:training-monitor

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are monitoring **one training job**. Execute this entire procedure from start to finish.

Supporting Files

steps/1-predict.mdsteps/2-collect.mdsteps/3-compare.mdsteps/4-metrics.mdsteps/5-resources.mdsteps/6-review.mdsteps/7-troubleshoot.mdsteps/8-strategy.md

SKILL.md

199 lines · ~1.8k tokens

Stats

Parent stars0

MaintenanceExcellent

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Training Monitor

You are monitoring one training job. Execute this entire procedure from start to finish.

Core Principles

Prediction-first: write predictions before reading logs, like writing tests before code.
Forced articulation: no status label without written reasoning that supports it.
Derived criteria: judge by the training's OWN artifacts (config, logged metrics, objective), not hardcoded thresholds.
"Process alive + GPU busy" is never evidence of progress.

State Protocol

All cross-session information is stored in files, not in context.

Operation	Path
Read previous state	`monitoring-logs/jobs/<job-id>.json`
Write current state	`monitoring-logs/jobs/<job-id>.json`
Session logs	`monitoring-logs/<timestamp>/`

Job ID = training config path + model path (stable across restarts). PIDs are NOT stable identifiers.

Per-job state uses namespace isolation:

meta: job identifier, last updated, session count
monitor: derived criteria, status history, user guidance
strategy: decisions, hypotheses, outcomes, evaluate_after

Anti-Skip Protocol

Before starting any work, create a task for EVERY step:

TaskCreate: "Step 1: Setup + Predictions"
TaskCreate: "Step 2: Collect Evidence"
TaskCreate: "Step 3: Compare Predictions vs Actuals"
TaskCreate: "Step 4: Analyze Metrics"
TaskCreate: "Step 5: Check Resources"
TaskCreate: "Step 6: Reviewer Audit"
TaskCreate: "Step 7: Troubleshoot (if needed)"
TaskCreate: "Step 8: Strategize"
TaskCreate: "Step 9: Write State"

Mark each task in_progress when starting, completed when the gate log is written.

Procedure

Step 1: Setup + Predictions

TaskUpdate: Step 1 -> in_progress

Read per-job state file if it exists (previous session's criteria, history, decisions).
Create session log directory: monitoring-logs/<timestamp>/ (format: YYYY-MM-DD_HHMMSS).
Write predictions for this job before reading any training evidence (logs, GPU metrics, dashboards).
- If per-job state exists: base predictions on previous session's values.
- If first session: base predictions on training config and general knowledge.
- Reference: steps/1-predict.md
Gate: write monitoring-logs/<timestamp>/1-predict.md

TaskUpdate: Step 1 -> completed

Step 2: Collect Evidence

TaskUpdate: Step 2 -> in_progress

Run ALL evidence collection commands in parallel. Reference: steps/2-collect.md

Gate: write monitoring-logs/<timestamp>/2-collect.md

TaskUpdate: Step 2 -> completed

Step 3: Compare Predictions vs Actuals

TaskUpdate: Step 3 -> in_progress

Build comparison table. Flag deviations that would change your health assessment. Reference: steps/3-compare.md

Gate: write monitoring-logs/<timestamp>/3-compare.md

TaskUpdate: Step 3 -> completed

Step 4: Analyze Metrics

TaskUpdate: Step 4 -> in_progress

Derive judgment criteria from the training's own artifacts. Assess health with articulated reasoning. Reference: steps/4-metrics.md

Load domain skills when the condition matches (MANDATORY -- loading required, following blindly not):

Skill	When to load
`grpo-monitor`	GRPO, PPO, or other RL algorithms
`distributed-monitor`	Multiple GPUs or processes
`k8s-monitor`	Kubernetes
`wandb-monitor`	Weights & Biases logging

Gate: write monitoring-logs/<timestamp>/4-metrics.md

TaskUpdate: Step 4 -> completed

Step 5: Check Resources

TaskUpdate: Step 5 -> in_progress

Reference: steps/5-resources.md

Gate: write monitoring-logs/<timestamp>/5-resources.md

TaskUpdate: Step 5 -> completed

Step 6: Reviewer Audit

TaskUpdate: Step 6 -> in_progress

Spawn a sub-agent to adversarially review your work from Steps 1-5. The reviewer checks PROCESS, not domain content.

Send the sub-agent:

Your gate logs from Steps 1-5
The reviewer checklist: read agents/quality-reviewer.md

If REJECTED: revise the flagged issues and resubmit. Maximum 2 rounds.

Reference: steps/6-review.md

Gate: write monitoring-logs/<timestamp>/6-review.md

TaskUpdate: Step 6 -> completed

Step 7: Troubleshoot (conditional)

Skip if: status is HEALTHY, or WARNING with no specific anomalies. Trigger if: status is CRITICAL, OR specific anomalies or deviations were found in Steps 2-5.

TaskUpdate: Step 7 -> in_progress

Investigate the anomaly systematically: observe with numbers, reproduce, isolate root cause, propose concrete action. Reference: steps/7-troubleshoot.md

Gate: write monitoring-logs/<timestamp>/7-troubleshoot.md

TaskUpdate: Step 7 -> completed

Step 8: Strategize

TaskUpdate: Step 8 -> in_progress

Propose next-step hypotheses based on monitoring results. This step triggers on ALL statuses:

HEALTHY: optional efficiency suggestions (lightweight, no full hypothesis structure required)
WARNING/CRITICAL/UNCERTAIN: 3 full hypotheses with falsifiable predictions

Present options to user via AskUserQuestion. After user choice, generate execution plan. Reference: steps/8-strategy.md

Gate: write monitoring-logs/<timestamp>/8-strategy.md

TaskUpdate: Step 8 -> completed

Step 9: Write State

TaskUpdate: Step 9 -> in_progress

Update per-job state file (monitoring-logs/jobs/<job-id>.json):

meta: job identifier, last updated timestamp, session count
monitor: derived criteria, current status, status history
strategy: chosen hypothesis, execution plan, success/failure criteria, evaluate_after timestamp

Gate: write monitoring-logs/<timestamp>/9-summary.md

TaskUpdate: Step 9 -> completed

Judgment Standard

Status	What you must provide
HEALTHY	Key progress indicator, expected behavior, baseline, evidence of progress beyond baseline. Conclusion follows from evidence.
WARNING	Full process completed. "I looked and found no progress," not "I didn't look."
CRITICAL	Specific data showing failure (NaN, process dead, metric collapsed).
UNCERTAIN	Highest effort. What was tried, why it failed, what would resolve it. Propose a specific question to the user.

Rules

NEVER read training evidence before writing predictions.
NEVER assign a status without written reasoning.
WARNING requires the full process. It means "I looked and found no progress."
Trust: hardware metrics (nvidia-smi) > software metrics (log) > external dashboards.
Every step must write its gate log with substantive content before proceeding.
Do not use efficiency, speed, or brevity as justification for skipping any step.
If anomaly detected, investigate NOW. Do not defer.
Report ALL GPUs, not just the ones you expect busy.

training-monitor

Invocation

Context Preview

Supporting Files

SKILL.md

training-monitor

Invocation

Context Preview

Supporting Files

SKILL.md

Training Monitor

Core Principles

State Protocol

Anti-Skip Protocol

Procedure

Step 1: Setup + Predictions

Step 2: Collect Evidence

Step 3: Compare Predictions vs Actuals

Step 4: Analyze Metrics

Step 5: Check Resources

Step 6: Reviewer Audit

Step 7: Troubleshoot (conditional)

Step 8: Strategize

Step 9: Write State

Judgment Standard

Rules

Similar Skills

Training Monitor

Core Principles

State Protocol

Anti-Skip Protocol

Procedure

Step 1: Setup + Predictions

Step 2: Collect Evidence

Step 3: Compare Predictions vs Actuals

Step 4: Analyze Metrics

Step 5: Check Resources

Step 6: Reviewer Audit

Step 7: Troubleshoot (conditional)

Step 8: Strategize

Step 9: Write State

Judgment Standard

Rules

Similar Skills