Skill

coreweave-core-workflow-a

Deploys KServe InferenceService on CoreWeave Kubernetes for GPU ML model serving with vLLM, autoscaling, scale-to-zero, and A100 affinity.

Kubernetes

Hugging Face

ai-ml

devops

Popularity

Parent stars

2,203

Parent forks

296

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/coreweave-pack:coreweave-core-workflow-a

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

ReadWriteEditBash(kubectl:*)Grep

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.

SKILL.md

130 lines · ~963 tokens

Stats

LanguagePython

Parent stars2,203

Parent forks296

MaintenanceGood

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

CoreWeave Core Workflow: KServe Inference

Overview

Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.

Prerequisites

Completed coreweave-install-auth setup
KServe available on your CKS cluster
Model stored in S3, GCS, or HuggingFace

Instructions

Step 1: Deploy an InferenceService

# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-inference
  annotations:
    autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
    autoscaling.knative.dev/metric: "concurrency"
    autoscaling.knative.dev/target: "1"
    autoscaling.knative.dev/minScale: "1"
    autoscaling.knative.dev/maxScale: "5"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    containers:
      - name: kserve-container
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--port"
          - "8080"
        ports:
          - containerPort: 8080
            protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 48Gi
            cpu: "8"
          requests:
            nvidia.com/gpu: "1"
            memory: 32Gi
            cpu: "4"
        env:
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token
                key: token
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: gpu.nvidia.com/class
                  operator: In
                  values: ["A100_PCIE_80GB"]

kubectl apply -f inference-service.yaml
kubectl get inferenceservice llama-inference -w

Step 2: Scale-to-Zero Configuration

# For dev/staging -- scale down to zero when idle
metadata:
  annotations:
    autoscaling.knative.dev/minScale: "0"    # Scale to zero
    autoscaling.knative.dev/maxScale: "3"
    autoscaling.knative.dev/scaleDownDelay: "5m"

Step 3: Test the Endpoint

# Get inference URL
INFERENCE_URL=$(kubectl get inferenceservice llama-inference \
  -o jsonpath='{.status.url}')

curl -X POST "${INFERENCE_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

Error Handling

Error	Cause	Solution
InferenceService not ready	GPU not available	Check node capacity and affinity
Scale-to-zero cold start	First request after idle	Set `minScale: 1` for production
Model loading timeout	Large model download	Pre-cache model in PVC
OOMKilled	Model too large	Use multi-GPU or quantized model

Resources

Next Steps

For GPU training workloads, see coreweave-core-workflow-b.

coreweave-core-workflow-a

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

coreweave-core-workflow-a

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

CoreWeave Core Workflow: KServe Inference

Overview

Prerequisites

Instructions

Step 1: Deploy an InferenceService

Step 2: Scale-to-Zero Configuration

Step 3: Test the Endpoint

Error Handling

Resources

Next Steps

Similar Skills

CoreWeave Core Workflow: KServe Inference

Overview

Prerequisites

Instructions

Step 1: Deploy an InferenceService

Step 2: Scale-to-Zero Configuration

Step 3: Test the Endpoint

Error Handling

Resources

Next Steps

Similar Skills