From vanguard-frontier-agentic
Reviews day-2 operations of NVIDIA GPU fleets: DCGM exporter/diag posture, GPU telemetry into Prometheus/Grafana, MIG partitioning lifecycle, Xid error runbooks, fleet upgrades, and incident response for GPU-failure modes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vanguard-frontier-agentic:nvidia-ai-operations-day2This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Review operational posture of NVIDIA GPU fleets against the NCP-AIO body of knowledge: DCGM and DCGM-Exporter coverage, MIG partitioning lifecycle, Xid error classification and runbooks, fleet driver/firmware upgrade orchestration, and GPU-aware incident response.
Review operational posture of NVIDIA GPU fleets against the NCP-AIO body of knowledge: DCGM and DCGM-Exporter coverage, MIG partitioning lifecycle, Xid error classification and runbooks, fleet driver/firmware upgrade orchestration, and GPU-aware incident response.
dcgmi diag, dcgmi health, nvidia-smi --query-gpu=..., Prometheus DCGM metrics) when the active client exposes it; otherwise fall back to NVIDIA DCGM documentation, sanitized dashboards, and Xid reference tables.Return, at minimum:
npx claudepluginhub raishin/vanguard-frontier-agentic --plugin vanguard-frontier-agenticReviews NVIDIA GPU infrastructure deployments (DGX, HGX, MGX) against reference architectures, checking BMC segmentation, firmware, driver versions, ECC, persistence mode, and MIG configuration.
Sets up GPU monitoring for CoreWeave Kubernetes clusters using DCGM exporter metrics and Prometheus alerts for utilization, memory usage, temperature, and inference pod health.
Diagnoses per-node issues on AWS HyperPod clusters (EKS or Slurm): unhealthy, unresponsive, stuck nodes. Covers EFA, GPU hardware (XID, ECC, NVLink, DCGM), Slurm node state, disk/memory pressure, lifecycle scripts, SSM agent, container runtime, kernel panics, pod networking. Read-only triage with suggested remediation commands.