From castai-pack
Diagnoses CAST AI agent offline, API errors, and autoscaling failures in Kubernetes using kubectl and curl. For nodes not scaling or agent crashes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/castai-pack:castai-common-errorsThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Diagnostic guide for the 10 most common CAST AI issues, covering agent connectivity, API errors, autoscaler failures, and node provisioning problems.
Diagnostic guide for the 10 most common CAST AI issues, covering agent connectivity, API errors, autoscaler failures, and node provisioning problems.
kubectl access to the clusterCASTAI_API_KEY configuredkubectl get pods -n castai-agent
kubectl logs -n castai-agent deployment/castai-agent --tail=50
Causes and fixes:
--set provider=eks|gke|aks correctly in Helmapi.cast.ai is allowed# Check agent heartbeat
kubectl logs -n castai-agent deployment/castai-agent | grep -i "heartbeat\|connect\|error"
# Verify network connectivity from inside the cluster
kubectl run castai-debug --image=curlimages/curl --rm -it --restart=Never -- \
curl -s -o /dev/null -w "%{http_code}" https://api.cast.ai/v1/kubernetes/external-clusters
Fix: Restart the agent pod: kubectl rollout restart deployment/castai-agent -n castai-agent
# Test API key
curl -s -o /dev/null -w "%{http_code}" \
-H "X-API-Key: ${CASTAI_API_KEY}" \
https://api.cast.ai/v1/kubernetes/external-clusters
# Should return 200, not 401
Fix: Generate a new API key at console.cast.ai > API > API Access Keys.
# Check for pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending
# Verify unschedulable pods policy is enabled
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
"https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
| jq '.unschedulablePods'
Causes:
unschedulablePods.enabled is false -- enable itclusterLimits.cpu.maxCores# Check node downscaler configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
"https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
| jq '.nodeDownscaler'
Causes:
nodeDownscaler.enabled is falsePodDisruptionBudget blocking evictionemptyNodes.delaySeconds# Check spot configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
"https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
| jq '.spotInstances'
Fix: Enable spotDiversityEnabled: true and set spotDiversityPriceIncreaseLimitPercent to 20-30 for better availability.
Symptoms: Pods being evicted too frequently, service disruption.
kubectl get events --field-selector reason=Evicted -A --sort-by=.lastTimestamp | tail -20
Fix: Increase evictor cycle interval or switch to non-aggressive mode:
helm upgrade castai-evictor castai-helm/castai-evictor \
-n castai-agent \
--set castai.apiKey="${CASTAI_API_KEY}" \
--set castai.clusterID="${CASTAI_CLUSTER_ID}" \
--set evictor.aggressiveMode=false \
--set evictor.cycleInterval=600
terraform plan -var-file=environments/prod.tfvars
# If drift detected:
terraform refresh -var-file=environments/prod.tfvars
Fix: Avoid mixing Terraform and console-based policy changes. Pick one source of truth.
# Check installed versions
helm list -n castai-agent
helm search repo castai-helm --versions | head -10
# Update to latest
helm repo update
helm upgrade castai-agent castai-helm/castai-agent -n castai-agent \
--reuse-values
kubectl logs -n castai-agent deployment/castai-workload-autoscaler --tail=50
Causes:
autoscaling.cast.ai/enabled: "true"For comprehensive diagnostics, see castai-debug-bundle.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin castai-packInstalls CAST AI agent on Kubernetes clusters via Helm charts or Terraform with API key auth. For EKS/GKE/AKS onboarding, cost optimization, autoscaling.
Reviews OVHcloud Managed Kubernetes cluster lifecycle, node pool sizing, autoscaling, version upgrades, workload placement, network policies, RBAC, and Terraform IaC for ovh_cloud_project_kube resources.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.