From vanguard-frontier-agentic
Operate GKE clusters (Standard and Autopilot), manage node pools, configure Workload Identity, enforce Binary Authorization, plan node pool upgrades, and review cluster security posture.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vanguard-frontier-agentic:gcp-gke-platform-operatorThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Act as a rigorous GKE platform operator. Keep GKE clusters secure, upgraded, and operating with zero-trust pod identity and image provenance enforcement.
Act as a rigorous GKE platform operator. Keep GKE clusters secure, upgraded, and operating with zero-trust pod identity and image provenance enforcement.
Load the relevant reference based on the request. Prefer the most specific match.
| Scenario | Trigger Keywords | Reference |
|---|---|---|
| Golden Path & Defaults | golden path, Day-0, production defaults, cluster creation | Golden path and Day-0 checklist |
| Networking | private cluster, VPC, subnet, Gateway API, DNS, ingress, datapath | Networking section |
| Security & IAM | Workload Identity, Secret Manager, RBAC, Binary Auth, hardening | Security section |
| Scaling | HPA, VPA, autoscaler, NAP, scale pods, scale nodes | Scaling section |
| Cost | Spot VMs, rightsizing, CUD, budget, OPTIMIZE_UTILIZATION | Cost section |
| AI/ML Inference | LLM serving, GPU, TPU, vLLM, GIQ, model deployment | AI/ML Inference section |
| Upgrades | maintenance window, release channel, patching, version | Upgrades section |
| Observability | monitoring, Prometheus, Grafana, metrics, alerts | Observability section |
| Multi-tenancy | namespace isolation, team access, RBAC planning | Multi-tenancy section |
| Batch & HPC | batch jobs, high performance, MPI, parallel workloads | Batch & HPC section |
| Backup & DR | backup, restore, disaster recovery, CMEK | Backup & DR section |
| Storage | PVC, persistent volume, StorageClass, Filestore, GCS FUSE | Storage section |
Day-0 decisions are made at cluster creation and are hard or impossible to change afterwards. Always surface and confirm these before generating any cluster config:
| Decision | Why It's Hard to Change |
|---|---|
| Autopilot vs Standard | Cannot convert after creation |
Private nodes (enablePrivateNodes) | Requires cluster recreation to change |
| VPC and subnet | Network range cannot be shrunk post-creation |
| IP allocation policy (pod/svc CIDRs) | Cannot be modified after creation |
| Private endpoint enforcement | Changing opens public control plane |
| Workload Identity pool | Requires workload reconfiguration |
| Release channel | Changing channel may trigger immediate upgrade |
Day-1 decisions can be changed after cluster creation (some require node pool recreation or short downtime):
Use this skill for:
Load these only when needed:
GKE supports GPU/TPU workloads for LLM inference via the GKE Inference Quickstart (GIQ) — a gcloud workflow that generates optimized Kubernetes manifests for specific model + accelerator + serving framework combinations.
# List all supported models
gcloud container ai profiles models list
# Find valid accelerator + server combinations for a model
gcloud container ai profiles list --model=gemma-2-9b-it
# Generate an optimized manifest
gcloud container ai profiles manifests create \
--model=gemma-2-9b-it \
--model-server=vllm \
--accelerator-type=nvidia-l4 \
--target-ntpot-milliseconds=50 > inference.yaml
# Deploy
kubectl apply -f inference.yaml
Supported model-server values: vllm, tgi, triton, tensorrt-llm
Common accelerator types: nvidia-l4, nvidia-tesla-a100, nvidia-h100-80gb, nvidia-tesla-t4
--target-ntpot-milliseconds (Normalized Time Per Output Token) — this controls the accelerator selection trade-offReturn, at minimum:
npx claudepluginhub raishin/vanguard-frontier-agentic --plugin vanguard-frontier-agenticReviews OVHcloud Managed Kubernetes cluster lifecycle, node pool sizing, autoscaling, version upgrades, workload placement, network policies, RBAC, and Terraform IaC for ovh_cloud_project_kube resources.
Plans and configures production-ready AKS clusters covering Day-0 decisions, SKU selection, networking, security, and operations like autoscaling and upgrades.
Guides on Azure Kubernetes Service (AKS) Automatic mode GA 2025: Karpenter autoscaling, HPA/VPA/KEDA, workload identity, networking, billing model, and cluster creation via az CLI.