From coreweave-pack
Sets up GPU monitoring for CoreWeave Kubernetes clusters using DCGM exporter metrics and Prometheus alerts for utilization, memory usage, temperature, and inference pod health.
How this skill is triggered — by the user, by Claude, or both
Slash command
/coreweave-pack:coreweave-observabilityThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
CKS clusters come with DCGM exporter pre-installed. Key metrics:
CKS clusters come with DCGM exporter pre-installed. Key metrics:
| Metric | Description |
|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU core utilization % |
DCGM_FI_DEV_FB_USED | GPU memory used (MB) |
DCGM_FI_DEV_FB_FREE | GPU memory free (MB) |
DCGM_FI_DEV_POWER_USAGE | Power consumption (W) |
DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) |
groups:
- name: coreweave-gpu
rules:
- alert: GPUUtilizationLow
expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
for: 30m
labels: { severity: warning }
annotations:
summary: "GPU utilization below 20% for 30min -- consider scaling down"
- alert: GPUMemoryHigh
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
for: 5m
labels: { severity: critical }
annotations:
summary: "GPU memory >95% -- risk of OOM"
- alert: InferencePodDown
expr: kube_deployment_status_replicas_available{deployment=~".*inference.*"} == 0
for: 2m
labels: { severity: critical }
For incident response, see coreweave-incident-runbook.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-packMonitors CoreWeave Kubernetes events, GPU utilization, and inference service health. Tracks pod lifecycles and sends alerts via kubectl and Python scripts.
Collects Vast.ai GPU instance metrics (utilization, costs, status) via CLI, logs to JSONL, and checks alerts for idle GPUs or high temps. Use for cost tracking and observability dashboards.
Reviews day-2 operations of NVIDIA GPU fleets: DCGM exporter/diag posture, GPU telemetry into Prometheus/Grafana, MIG partitioning lifecycle, Xid error runbooks, fleet upgrades, and incident response for GPU-failure modes.