From exec-remote
Deploys a SkyPilot-managed TPU cluster on GKE. Automatically ensures the required node pool exists for the requested TPU type, creating one if necessary. Supports running multiple TPU types in parallel on the same GKE cluster.
How this skill is triggered — by the user, by Claude, or both
Slash command
/exec-remote:deploy-clusterThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill deploys a SkyPilot-managed TPU cluster on an existing GKE cluster. It builds on the `apply-resource` skill which handles GKE cluster creation via xpk.
This skill deploys a SkyPilot-managed TPU cluster on an existing GKE cluster. It builds on the apply-resource skill which handles GKE cluster creation via xpk.
Key Feature: Each TPU type gets its own SkyPilot cluster (named <cluster>-<username>-<tpu_type>), allowing multiple topologies to run in parallel on the same GKE cluster. Node pools are automatically managed per TPU type.
pip install skypilot
sky --helpgcloud auth login to authenticateThe following defaults apply unless the user explicitly overrides them:
| Parameter | Default |
|---|---|
| PROJECT_ID | tpu-service-473302 |
| CLUSTER_NAME | sglang-jax-agent-tests |
| ZONE | asia-northeast1-b |
Use these values directly — do NOT ask the user to confirm or re-enter them unless they specify otherwise.
tpu-service-473302)v6e-1, v6e-4, v6e-16) — must be specifiedasia-northeast1-b)sglang-jax-agent-tests)If all parameters are already known from an upstream caller (e.g., exec-remote), use them directly -- do NOT re-ask. Only prompt interactively when this skill is invoked standalone and the user wants to override defaults.
Each GKE node exposes 4 TPU chips (google.com/tpu: 4), except v6e-1 which exposes 1 chip.
Therefore: num_nodes = total_chips / 4, and every pod always requests 4 chips (1 for v6e-1).
| Type | Topology | Chips/Host | Nodes | Machine Type |
|---|---|---|---|---|
| v6e-1 | 1x1 | 1 | 1 | ct6e-standard-1t |
| v6e-4 | 2x2 | 4 | 1 | ct6e-standard-4t |
| v6e-8 | 2x4 | 4 | 2 | ct6e-standard-4t |
| v6e-16 | 4x4 | 4 | 4 | ct6e-standard-4t |
| v6e-32 | 4x8 | 4 | 8 | ct6e-standard-4t |
| v6e-64 | 8x8 | 4 | 16 | ct6e-standard-4t |
| v6e-128 | 8x16 | 4 | 32 | ct6e-standard-4t |
| v6e-256 | 16x16 | 4 | 64 | ct6e-standard-4t |
Zone vs Region: xpk always creates GKE clusters at the region level (e.g.,
asia-northeast1), even when given a zone likeasia-northeast1-b. The deploy script handles this automatically -- you may pass either a zone or a region.
Use the apply-resource skill to create the GKE cluster (or confirm it already exists). This only needs to be done once:
/apply-resource create
Carry forward the resulting CLUSTER_NAME, TPU_TYPE, and ZONE for Step 2.
Before deploying SkyPilot, ensure the GKE cluster status is RUNNING:
gcloud container clusters list --project=$PROJECT_ID \
--filter="name=<CLUSTER_NAME>" --format="table(name,location,status)"
If status is RECONCILING or PROVISIONING, wait until it becomes RUNNING.
Run the deploy script (located in the scripts/ directory alongside this skill definition):
python scripts/deploy.py <TPU_TYPE> [CLUSTER_NAME] [ZONE]
Only TPU_TYPE is required. CLUSTER_NAME defaults to sglang-jax-agent-tests, ZONE defaults to asia-northeast1-b.
This script will:
gcloudtpu-<TPU_TYPE>, e.g., tpu-v6e-1)~/.sky/config.yaml from the template with correct TPU parameterssetup.yaml with the correct num_nodessky launch -c <CLUSTER_NAME>-<USERNAME>-<TPU_TYPE> -r <setup.yaml>.cluster_name_tpu in the plugin root (for exec-remote integration)sky status # Check cluster status
sky exec <CLUSTER_NAME> 'echo hello' # Test remote execution
The deploy script intelligently manages GKE node pools:
machineType and tpuTopology. This detects pools created by xpk, manually, or by previous runs.tpu-<type> (e.g., tpu-v6e-1, tpu-v6e-4). Single-host TPUs (v6e-1, v6e-4) omit --tpu-topology as GKE infers it from the machine type.nodeSelector ensures pods land on the correct pool.--spot and autoscaling (--min-nodes=0).# First time: create cluster via apply-resource (uses defaults)
/apply-resource create
# Deploy both TPU types (sequentially — config.yaml is global)
python scripts/deploy.py v6e-1
# Creates SkyPilot cluster: sglang-jax-agent-tests-hongmao-v6e-1
python scripts/deploy.py v6e-4
# Creates SkyPilot cluster: sglang-jax-agent-tests-hongmao-v6e-4
# Run tests in parallel on both clusters
sky exec sglang-jax-agent-tests-hongmao-v6e-1 'python test/srt/run_suite.py --suite unit-test-tpu-v6e-1' &
sky exec sglang-jax-agent-tests-hongmao-v6e-4 'python test/srt/run_suite.py --suite e2e-test-tpu-v6e-4' &
wait
Note:
deploy.pycalls must be sequential because~/.sky/config.yamlis a global file shared by all SkyPilot operations. However, once both clusters are launched,sky execcommands can run fully in parallel since pods already have the correct node affinity baked in.
The deploy script (scripts/deploy.py) automates:
gcloud container clusters get-credentialsgcloud beta container node-pools create with correct TPU paramsconfig.yaml template -> replaces placeholders -> writes to ~/.sky/config.yamlsetup.yaml template -> replaces <NUM_NODES> -> writes to temp filesky launch -c <cluster>-<user>-<tpu_type> -r <setup.yaml>To tear down SkyPilot clusters:
sky down <CLUSTER_NAME>-<USERNAME>-v6e-1
sky down <CLUSTER_NAME>-<USERNAME>-v6e-4
To also remove the GKE cluster:
/apply-resource delete
npx claudepluginhub primatrix/skills --plugin exec-remoteLaunches GPU/TPU clusters, training jobs, and inference servers across 25+ clouds, Kubernetes, Slurm using SkyPilot; debugs YAML, optimizes costs.
Operate GKE clusters (Standard and Autopilot), manage node pools, configure Workload Identity, enforce Binary Authorization, plan node pool upgrades, and review cluster security posture.
Provisions and manages on-demand or reserved GPU clusters (H100, H200, B200) on Together AI with Kubernetes or Slurm orchestration, shared storage, and credential management for ML and HPC workloads.