From azure
Walks through setting up AI Runway on an existing AKS cluster: cluster verification, controller install, GPU assessment, provider setup, and first model deployment.
How this skill is triggered — by the user, by Claude, or both
Slash command
/azure:airunway-aks-setup [skip-to-step N][skip-to-step N]The summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides `skip-to-step N` to resume from a specific phase.
references/gpu-profiles.mdreferences/model-sizing.mdreferences/powershell-notes.mdreferences/steps/step-1-verify.mdreferences/steps/step-2-controller.mdreferences/steps/step-3-gpu.mdreferences/steps/step-4-provider.mdreferences/steps/step-5-deploy.mdreferences/steps/step-6-summary.mdreferences/troubleshooting.mdThis skill walks users from a bare Kubernetes cluster to a running AI model deployment. Follow each step in sequence unless the user provides skip-to-step N to resume from a specific phase.
Cost awareness: GPU node pools incur significant compute charges (A100-80GB can cost $3–5+/hr). Confirm the user understands cost implications before provisioning GPU resources.
This skill assumes an AKS cluster already exists. If the user does not have a cluster, hand off to the azure-kubernetes skill first to provision one (with a GPU node pool unless CPU-only inference is acceptable), then return here.
| Property | Value |
|---|---|
| Best for | End-to-end AI Runway onboarding on AKS |
| CLI tools | kubectl, make, curl |
| MCP tools | None |
| Related skills | azure-kubernetes (cluster setup), azure-diagnostics (troubleshooting) |
Use this skill when the user wants to:
This skill uses no MCP tools. All cluster operations are performed directly via kubectl and make.
skip-to-step N, start at step N; assume prior steps are complete| # | Step | Reference |
|---|---|---|
| 1 | Cluster Verification — context check, node inventory, GPU detection | step-1-verify.md |
| 2 | Controller Installation — CRD + controller deployment | step-2-controller.md |
| 3 | GPU Assessment — detect GPU models, flag dtype/attention constraints | step-3-gpu.md |
| 4 | Provider Setup — recommend and install inference provider | step-4-provider.md |
| 5 | First Deployment — pick a model, deploy, verify Ready | step-5-deploy.md |
| 6 | Summary — recap, smoke test, next steps | step-6-summary.md |
| Error / Symptom | Likely Cause | Remediation |
|---|---|---|
| No kubeconfig context | Not connected to a cluster | Run az aks get-credentials or equivalent |
| Controller in CrashLoopBackOff | Config or RBAC issue | kubectl logs -n airunway-system -l control-plane=controller-manager --previous |
| Provider not ready | Image pull or RBAC issue | kubectl logs <pod-name> -n <namespace> for the provider pod |
| ModelDeployment stuck in Pending | GPU scheduling failure or provider not ready | kubectl describe modeldeployment <name> -n <namespace> events |
bfloat16 errors at inference | T4 or V100 lacks bfloat16 support | Add --dtype float16 to serving args |
For full error handling and rollback procedures, see troubleshooting.md.
npx claudepluginhub anthropics/claude-plugins-official --plugin azureSets up, scales, validates, and recovers NVIDIA Physical AI infrastructure for synthetic data generation on MicroK8s or Azure AKS, including Kubernetes, inference endpoints, and OSMO deployment.
Plans and configures production-ready AKS clusters covering Day-0 decisions, SKU selection, networking, security, and operations like autoscaling and upgrades.
Provides expert guidance for Azure Kubernetes Service (AKS) covering troubleshooting, best practices, architecture, security, deployment, and multi-cluster or service mesh scenarios.