From coreweave-pack
Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors. Use when pods are stuck Pending, GPUs are not allocated, or experiencing CUDA and NCCL errors. Trigger with phrases like "coreweave error", "coreweave pod pending", "coreweave gpu not found", "coreweave debug", "fix coreweave".
How this skill is triggered — by the user, by Claude, or both
Slash command
/coreweave-pack:coreweave-common-errorsThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
```bash
kubectl describe pod <pod-name> | grep -A5 Events
# "0/N nodes are available: insufficient nvidia.com/gpu"
Fix: Check GPU availability: kubectl get nodes -l gpu.nvidia.com/class=A100_PCIE_80GB. Try a different GPU type or region.
torch.cuda.OutOfMemoryError: CUDA out of memory
Fix: Reduce batch size, enable gradient checkpointing, or use a larger GPU (A100-80GB instead of 40GB).
Fix: Create an imagePullSecret:
kubectl create secret docker-registry regcred \
--docker-server=ghcr.io \
--docker-username=$GH_USER \
--docker-password=$GH_TOKEN
NCCL error: unhandled system error
Fix: Ensure all GPUs are on the same node (NVLink). For multi-node, use InfiniBand-connected nodes.
Fix: Check storage class availability: kubectl get sc. Use CoreWeave storage classes like shared-hdd-ord1 or shared-ssd-ord1.
Fix: List valid GPU class labels:
kubectl get nodes -o json | jq -r '.items[].metadata.labels["gpu.nvidia.com/class"]' | sort -u
Fix: Check Service and Endpoints:
kubectl get svc,endpoints <service-name>
For diagnostics, see coreweave-debug-bundle.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub flight505/skill-forge --plugin coreweave-pack