From coreweave-pack
Deploys KServe InferenceService on CoreWeave Kubernetes for GPU ML model serving with vLLM, autoscaling, scale-to-zero, and A100 affinity.
How this skill is triggered — by the user, by Claude, or both
Slash command
/coreweave-pack:coreweave-core-workflow-aThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.
Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.
coreweave-install-auth setup# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-inference
annotations:
autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "1"
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "5"
spec:
predictor:
minReplicas: 1
maxReplicas: 5
containers:
- name: kserve-container
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--port"
- "8080"
ports:
- containerPort: 8080
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
memory: 48Gi
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: 32Gi
cpu: "4"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_PCIE_80GB"]
kubectl apply -f inference-service.yaml
kubectl get inferenceservice llama-inference -w
# For dev/staging -- scale down to zero when idle
metadata:
annotations:
autoscaling.knative.dev/minScale: "0" # Scale to zero
autoscaling.knative.dev/maxScale: "3"
autoscaling.knative.dev/scaleDownDelay: "5m"
# Get inference URL
INFERENCE_URL=$(kubectl get inferenceservice llama-inference \
-o jsonpath='{.status.url}')
curl -X POST "${INFERENCE_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
| Error | Cause | Solution |
|---|---|---|
| InferenceService not ready | GPU not available | Check node capacity and affinity |
| Scale-to-zero cold start | First request after idle | Set minScale: 1 for production |
| Model loading timeout | Large model download | Pre-cache model in PVC |
| OOMKilled | Model too large | Use multi-GPU or quantized model |
For GPU training workloads, see coreweave-core-workflow-b.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin coreweave-packDeploys AI inference services on CoreWeave Kubernetes using Helm charts and Kustomize for GPU scaling and multi-model setups.
Deploys vLLM OpenAI-compatible server to Kubernetes with GPU support, health probes, and services via YAML templates. Checks HF token secret and existing deployments before applying.
Deploys ML models to production serving infrastructure using MLflow, BentoML, or Seldon Core with REST/gRPC endpoints. Implements autoscaling, monitoring, and A/B testing for real-time inference.