Skill

monitor-doctor

Interactive setup wizard for the training-monitor plugin. Goal-driven — the agent determines which dependencies are needed by checking project context and asking the user only when the context is ambiguous. Installs missing dependencies and reports available capabilities.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/training-monitor:monitor-doctor

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Set up the training-monitor plugin for the user's environment. The goal is to determine which domain skills and external dependencies are needed, install what is missing, and report available capabilities.

SKILL.md

114 lines · ~1.4k tokens

Stats

Parent stars0

MaintenanceExcellent

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Monitor Doctor

Goal

Determine which entries in the Dependency Registry below apply to the user's situation. For each applicable entry, check if its dependencies are installed. Offer to install what is missing. Report the final capability state.

Strategy

Do NOT follow a fixed questionnaire. Instead:

Check context first. Read the project directory, look at running processes, check installed packages. Many answers are already available without asking.
Ask only when ambiguous. If the context makes the answer clear, skip the question. If the user already stated their setup (e.g., "I'm running GRPO on K8s with W&B"), extract the answers directly — do not re-ask.
Ask efficiently. If multiple things are unclear, batch them into a single AskUserQuestion call. Do not ask one question at a time if you can ask several at once.
Be adaptive. If the user's first message already provides enough information to determine all dependencies, go straight to checking and installing. Zero questions is a valid outcome.

Context Signals to Check

Signal	What it tells you	How to check
RL/GRPO/PPO keywords in config or code	GRPO/RL training → need `grpo-monitor`	`grep -ri "grpo\|ppo\|rloo\|reward\|kl_loss" <project_dir>`
Multiple GPU processes or torchrun/deepspeed in command	Distributed training → need `distributed-monitor`	`ps aux \| grep -E "torchrun\|deepspeed\|accelerate"`, `nvidia-smi` showing multiple GPUs used
K8s YAML files, kubectl available, namespace references	Kubernetes infra → need `k8s-monitor`	`ls .yaml k8s* 2>/dev/null`, `kubectl version --client 2>/dev/null`
wandb imported in code, wandb dir exists, WANDB_API_KEY set	W&B tracking → need `wandb-monitor`	`grep -ri "import wandb\|wandb.init" <project_dir>`, `ls wandb/ 2>/dev/null`, `echo $WANDB_API_KEY`
tensorboard logs, SummaryWriter in code	TensorBoard tracking (no dedicated skill yet)	`grep -ri "SummaryWriter\|tensorboard" <project_dir>`, `ls runs/ tb_logs/ 2>/dev/null`

Dependency Registry

This is the complete list of domain skills and their external dependencies. The agent uses this registry to determine what to check and install.

Skill	What it provides	External dependencies	Install commands
`training-monitor`	Core monitoring orchestrator	`nvidia-smi`	(pre-installed on GPU machines)
`grpo-monitor`	RL metrics, generation quality, phase time	none	—
`distributed-monitor`	NCCL diagnostics, process hierarchy, stragglers	none	—
`k8s-monitor`	Pod anomalies, scheduling escalation	`kubectl` with cluster access	`curl -LO https://dl.k8s.io/release/$(curl -sL https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl && chmod +x kubectl && mv kubectl ~/.local/bin/`
`wandb-monitor`	Heartbeat stall detection, metric keys, health thresholds	`wandb-primary` skill + `wandb` Python package + authentication	`npx skills add wandb/skills` then `pip install wandb` then `wandb login`

Procedure

1. Determine Applicable Skills

Using the context signals and any information the user has provided, determine which rows in the Dependency Registry apply. If anything is ambiguous, ask the user using AskUserQuestion.

2. Check Dependencies

For each applicable row, check whether its external dependencies are installed. Run the check commands and collect results into three categories:

Installed: dependency is present and working
Missing: dependency is absent but needed
Not needed: dependency is for a skill that does not apply

3. Offer Installation

If there are missing dependencies, present them to the user via AskUserQuestion (multi-select). Each option should show the dependency name and its install command. The user selects which ones to install.

For each selected dependency:

Execute the install command
Verify it succeeded
If it fails, report the error and continue

For each skipped dependency:

Mark the corresponding capability as DEGRADED
Explain what the user will miss

4. Report Capabilities

Present the final state:

Monitor Doctor — Setup Complete
================================
Environment: [detected/stated training setup summary]

Skills:
  [OK]  training-monitor (core)
  [OK/--]  grpo-monitor
  [OK/--]  distributed-monitor
  [OK/--]  k8s-monitor
  [OK/--]  wandb-monitor

Dependencies:
  [OK/MISSING/--] nvidia-smi
  [OK/MISSING/--] kubectl
  [OK/MISSING/--] wandb-primary skill
  [OK/MISSING/--] wandb package
  [OK/MISSING/--] wandb auth

Capabilities:
  [OK/DEGRADED/OFF] [capability] — [reason if not OK]
  ...
================================

5. Save Profile (optional)

If the project has a .claude/ directory, offer to save the environment profile:

{
  "training_type": "grpo",
  "infrastructure": "k8s",
  "metric_trackers": ["wandb"],
  "skills_to_load": ["grpo-monitor", "k8s-monitor", "wandb-monitor"]
}

This allows training-monitor to automatically load the right domain skills in future sessions.

monitor-doctor

Invocation

Context Preview

SKILL.md

monitor-doctor

Invocation

Context Preview

SKILL.md

Monitor Doctor

Goal

Strategy

Context Signals to Check

Dependency Registry

Procedure

1. Determine Applicable Skills

2. Check Dependencies

3. Offer Installation

4. Report Capabilities

5. Save Profile (optional)

Similar Skills

Monitor Doctor

Goal

Strategy

Context Signals to Check

Dependency Registry

Procedure

1. Determine Applicable Skills

2. Check Dependencies

3. Offer Installation

4. Report Capabilities

5. Save Profile (optional)

Similar Skills