From training-monitor
Heuristics for monitoring training jobs on Kubernetes. Common patterns, pod anomalies, scheduling failures, escalation ladder. Reference knowledge, not rules.
How this skill is triggered — by the user, by Claude, or both
Slash command
/training-monitor:k8s-monitorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Heuristics for monitoring training jobs on Kubernetes. This skill provides **reference knowledge** about common K8s patterns and failure modes — not rules or checklists. Use it to inform your reasoning about what to check in K8s environments.
Heuristics for monitoring training jobs on Kubernetes. This skill provides reference knowledge about common K8s patterns and failure modes — not rules or checklists. Use it to inform your reasoning about what to check in K8s environments.
# Pod status
kubectl get pods -n <ns> -l job-name=<name> -o wide
# Recent logs
kubectl logs -n <ns> -l job-name=<name> --tail=20
# Error scan
kubectl logs -n <ns> -l job-name=<name> --tail=200 2>/dev/null | grep -i "error\|exception\|killed\|oom\|timeout\|nan\|inf" | tail -5
kubectl exec <pod> -n <ns> -- df -h /mount/pathDetect: pod not Running, RESTARTS > 0, pod evicted/preempted, NCCL errors, OOM-killed, node not ready, pod stuck in Pending/ContainerCreating >5 min.
Action:
kubectl describe pod <name> -n <ns> | tail -30 for events.The cluster cannot satisfy your resource request. Do NOT wait indefinitely. Diagnose and adapt:
kubectl describe pod <name> -n <ns> | grep -A3 "Events\|FailedScheduling"
Common reasons:
Try in order, stop when it schedules:
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu)After each adaptation, resubmit and set a 5-minute timer. If still Pending, try next rung on the ladder.
npx claudepluginhub t2ance/training-monitor-pluginProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.