From training-monitor
Interactive setup wizard for the training-monitor plugin. Goal-driven — the agent determines which dependencies are needed by checking project context and asking the user only when the context is ambiguous. Installs missing dependencies and reports available capabilities.
How this skill is triggered — by the user, by Claude, or both
Slash command
/training-monitor:monitor-doctorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Set up the training-monitor plugin for the user's environment. The goal is to determine which domain skills and external dependencies are needed, install what is missing, and report available capabilities.
Set up the training-monitor plugin for the user's environment. The goal is to determine which domain skills and external dependencies are needed, install what is missing, and report available capabilities.
Determine which entries in the Dependency Registry below apply to the user's situation. For each applicable entry, check if its dependencies are installed. Offer to install what is missing. Report the final capability state.
Do NOT follow a fixed questionnaire. Instead:
| Signal | What it tells you | How to check |
|---|---|---|
| RL/GRPO/PPO keywords in config or code | GRPO/RL training → need grpo-monitor | grep -ri "grpo|ppo|rloo|reward|kl_loss" <project_dir> |
| Multiple GPU processes or torchrun/deepspeed in command | Distributed training → need distributed-monitor | ps aux | grep -E "torchrun|deepspeed|accelerate", nvidia-smi showing multiple GPUs used |
| K8s YAML files, kubectl available, namespace references | Kubernetes infra → need k8s-monitor | ls *.yaml *k8s* 2>/dev/null, kubectl version --client 2>/dev/null |
| wandb imported in code, wandb dir exists, WANDB_API_KEY set | W&B tracking → need wandb-monitor | grep -ri "import wandb|wandb.init" <project_dir>, ls wandb/ 2>/dev/null, echo $WANDB_API_KEY |
| tensorboard logs, SummaryWriter in code | TensorBoard tracking (no dedicated skill yet) | grep -ri "SummaryWriter|tensorboard" <project_dir>, ls runs/ tb_logs/ 2>/dev/null |
This is the complete list of domain skills and their external dependencies. The agent uses this registry to determine what to check and install.
| Skill | What it provides | External dependencies | Install commands |
|---|---|---|---|
training-monitor | Core monitoring orchestrator | nvidia-smi | (pre-installed on GPU machines) |
grpo-monitor | RL metrics, generation quality, phase time | none | — |
distributed-monitor | NCCL diagnostics, process hierarchy, stragglers | none | — |
k8s-monitor | Pod anomalies, scheduling escalation | kubectl with cluster access | curl -LO https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl && chmod +x kubectl && mv kubectl ~/.local/bin/ |
wandb-monitor | Heartbeat stall detection, metric keys, health thresholds | wandb-primary skill + wandb Python package + authentication | npx skills add wandb/skills then pip install wandb then wandb login |
Using the context signals and any information the user has provided, determine which rows in the Dependency Registry apply. If anything is ambiguous, ask the user using AskUserQuestion.
For each applicable row, check whether its external dependencies are installed. Run the check commands and collect results into three categories:
If there are missing dependencies, present them to the user via AskUserQuestion (multi-select). Each option should show the dependency name and its install command. The user selects which ones to install.
For each selected dependency:
For each skipped dependency:
Present the final state:
Monitor Doctor — Setup Complete
================================
Environment: [detected/stated training setup summary]
Skills:
[OK] training-monitor (core)
[OK/--] grpo-monitor
[OK/--] distributed-monitor
[OK/--] k8s-monitor
[OK/--] wandb-monitor
Dependencies:
[OK/MISSING/--] nvidia-smi
[OK/MISSING/--] kubectl
[OK/MISSING/--] wandb-primary skill
[OK/MISSING/--] wandb package
[OK/MISSING/--] wandb auth
Capabilities:
[OK/DEGRADED/OFF] [capability] — [reason if not OK]
...
================================
If the project has a .claude/ directory, offer to save the environment profile:
{
"training_type": "grpo",
"infrastructure": "k8s",
"metric_trackers": ["wandb"],
"skills_to_load": ["grpo-monitor", "k8s-monitor", "wandb-monitor"]
}
This allows training-monitor to automatically load the right domain skills in future sessions.
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub t2ance/training-monitor-plugin