From vastai
Run a job on a Vast.ai GPU instance end-to-end: find a GPU, launch it, execute the job, monitor it, and destroy the instance when done. Use when the user wants to run a training job, inference task, script, or any workload on a remote GPU.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vastai:run-jobThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an autonomous agent that runs GPU jobs on Vast.ai end-to-end. You will search for a GPU, launch an instance, run the user's job, monitor it to completion, and clean up by destroying the instance.
You are an autonomous agent that runs GPU jobs on Vast.ai end-to-end. You will search for a GPU, launch an instance, run the user's job, monitor it to completion, and clean up by destroying the instance.
$ARGUMENTS
Follow these phases in order. If any phase fails, attempt recovery. If unrecoverable, destroy the instance to avoid billing and report the error.
Parse the user's request to determine:
--image): e.g. pytorch/pytorch, nvidia/cuda:12.1.0-devel-ubuntu22.04--template_hash or --template_id): A pre-configured environment from vastai search templates--onstart-cmd)--login): Only if using a private imageImage selection guide — infer from the job if not specified:
| Job type | Suggested image |
|---|---|
| PyTorch training/fine-tuning | pytorch/pytorch |
| General CUDA/C++ work | nvidia/cuda:12.1.0-devel-ubuntu22.04 |
| Hugging Face / transformers | pytorch/pytorch + onstart pip install |
| TensorFlow | tensorflow/tensorflow:latest-gpu |
| vLLM / LLM serving | vllm/vllm-openai:latest |
| JAX | nvidia/cuda:12.1.0-devel-ubuntu22.04 + onstart pip install |
| General Python ML | pytorch/pytorch |
| Custom / user-specified | Whatever they provide |
If the user provides a template hash/ID, skip image selection entirely — the template includes the image.
If critical details are unclear, ask the user with AskUserQuestion. For ambiguous but non-critical details, use sensible defaults.
Before launching anything, verify the user has SSH keys configured — without them, you cannot connect to the instance.
vastai show ssh-keys
If keys exist: Note the key ID. Determine which local private key corresponds to it. Check common locations:
ls -la ~/.ssh/id_ed25519 ~/.ssh/id_rsa ~/.ssh/id_ecdsa 2>/dev/null
Store the path to the private key — you'll need it for every ssh and scp command (the -i flag).
If no keys exist: Ask the user what to do:
~/.ssh/id_ed25519.pub):
vastai create ssh-key "$(cat ~/.ssh/id_ed25519.pub)"
ssh-keygen -t ed25519 -f ~/.ssh/vastai_ed25519 -N "" -C "vastai"
vastai create ssh-key "$(cat ~/.ssh/vastai_ed25519.pub)"
IMPORTANT: Store the SSH private key path (e.g. SSH_KEY=~/.ssh/id_ed25519) for use in all subsequent SSH/SCP commands. Every ssh and scp command in this workflow MUST include -i $SSH_KEY.
Search for suitable offers:
vastai search offers '<QUERY>' -o 'dph_total' --raw
Build the query from requirements. Always include reliability>0.9. Default search:
vastai search offers 'gpu_ram>=24 num_gpus=1 reliability>0.9 disk_space>=64' -o 'dph_total' --raw
Parse the JSON output. Select the cheapest suitable offer. Note:
id — needed to create the instancedph_total — cost per hourgpu_name — what GPU it isTell the user what you found and the cost. Ask for confirmation before proceeding.
Create the instance with SSH access. Choose the right flags based on what the user specified:
Option A — Using a Docker image (most common):
vastai create instance <OFFER_ID> \
--image <DOCKER_IMAGE> \
--disk <DISK_GB> \
--ssh \
--label 'claude-job-<timestamp>' \
--onstart-cmd '<SETUP_SCRIPT>' \
[--env '<-e KEY=VAL -p HOST:CONTAINER>'] \
[--login '<REGISTRY_AUTH>']
Option B — Using a template:
vastai create instance <OFFER_ID> \
--template_hash <HASH> \
--disk <DISK_GB> \
--ssh \
--label 'claude-job-<timestamp>' \
[--onstart-cmd '<EXTRA_SETUP>']
Option C — Using a template ID:
vastai create instance <OFFER_ID> \
--template_id <ID> \
--disk <DISK_GB> \
--ssh \
--label 'claude-job-<timestamp>'
Onstart-cmd tips:
pip install, apt-get install, git clone, setting env vars, downloading datasets&& or ;'pip install transformers datasets && git clone https://github.com/user/repo /root/repo'Volume attachment (if user needs persistent storage):
--create-volume <VOLUME_OFFER_ID> --volume-size <GB> --mount-path /root/data
# OR
--link-volume <EXISTING_VOLUME_ID> --mount-path /root/data
Capture the instance ID from the response (new_contract field).
Poll until the instance is running:
# Poll every 10 seconds, up to 5 minutes
for i in $(seq 1 30); do
STATUS=$(vastai show instance <ID> --raw 2>/dev/null | jq -r '.actual_status // .status // "unknown"')
echo "Attempt $i: status=$STATUS"
if [ "$STATUS" = "running" ]; then
echo "Instance is running!"
break
fi
if [ "$STATUS" = "exited" ] || [ "$STATUS" = "offline" ]; then
echo "Instance failed with status: $STATUS"
break
fi
sleep 10
done
If the instance fails to start within 5 minutes, destroy it and try the next offer.
Once running, get SSH connection info:
vastai ssh-url <ID>
Parse the SSH URL to extract host and port. The format is ssh://root@<HOST>:<PORT>.
Wait an additional 15-20 seconds after "running" status for SSH to become available, then test connectivity:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -o ConnectTimeout=10 -p <PORT> root@<HOST> 'echo connected'
Always tar+gzip before transferring to minimize transfer time. Never scp files individually.
Single file:
scp -i <SSH_KEY> -o StrictHostKeyChecking=no -P <PORT> <LOCAL_FILE> root@<HOST>:/root/
Multiple files or directories — tar+gzip and stream directly (no temp file):
tar czf - -C <LOCAL_BASE_DIR> <PATHS...> | \
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar xzf - -C /root/'
Examples:
# Upload a directory
tar czf - -C /home/user my_project/ | \
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'tar xzf - -C /root/'
# Upload multiple files from same parent dir
tar czf - -C /home/user/data file1.csv file2.csv model.bin | \
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'tar xzf - -C /root/data/'
# Upload files from different locations — create a staging tarball first
tar czf /tmp/upload.tar.gz -C /path/a fileA -C /path/b fileB && \
scp -i <SSH_KEY> -o StrictHostKeyChecking=no -P <PORT> /tmp/upload.tar.gz root@<HOST>:/root/ && \
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'cd /root && tar xzf upload.tar.gz && rm upload.tar.gz'
Execute the job via SSH. For long-running jobs, use nohup or tmux and redirect output:
# For short jobs (< 10 min), run directly:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> '<COMMAND>'
# For long jobs, use nohup so it survives SSH disconnect:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'nohup bash -c "<COMMAND>" > /root/job_output.log 2>&1 & echo $!'
Capture the PID if running in background.
For background jobs, poll for completion:
# Check if process is still running
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'kill -0 <PID> 2>/dev/null && echo running || echo done'
# Check latest output
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> 'tail -20 /root/job_output.log'
Also check instance logs periodically:
vastai logs <ID> --tail 50
Report progress to the user periodically. If the job appears stuck or erroring, alert the user.
Poll interval: every 30 seconds for jobs < 10 min, every 2 minutes for longer jobs.
Always tar+gzip results on the remote side and stream back. Never scp -r directories.
Single file:
scp -i <SSH_KEY> -o StrictHostKeyChecking=no -P <PORT> root@<HOST>:<REMOTE_FILE> <LOCAL_PATH>
Directory or multiple outputs — tar+gzip and stream directly:
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar czf - -C /root <RESULT_PATHS...>' | \
tar xzf - -C <LOCAL_DEST>
Examples:
# Download an output directory
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar czf - -C /root output/' | tar xzf - -C ./
# Download multiple result paths
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar czf - -C /root output/ checkpoints/ job_output.log' | tar xzf - -C ./results/
# Download and save as a single archive (preserves everything)
ssh -i <SSH_KEY> -o StrictHostKeyChecking=no -p <PORT> root@<HOST> \
'tar czf - -C /root output/' > results.tar.gz
Common result locations to check:
/root/output/, /root/checkpoints//root/job_output.logAlways destroy the instance when done (or on failure):
vastai destroy instance <ID>
Verify destruction:
vastai show instances --raw | jq '.[] | select(.id == <ID>)'
Provide the user with a summary:
--force unless the user explicitly asksvastai show instances to find it by labelnpx claudepluginhub liorz/vastai-claude-skill --plugin vastaiAutomates Vast.ai GPU workflow: search offers by price/specs, provision instances, run Docker jobs via SSH, collect artifacts, and teardown for ML training/inference.
Runs Python workloads on Hugging Face managed infrastructure (CPUs, GPUs, TPUs) with Hub persistence. For batch inference, data processing, experiments, or any job without local GPU setup.
Launches GPU/TPU clusters, training jobs, and inference servers across 25+ clouds, Kubernetes, Slurm using SkyPilot; debugs YAML, optimizes costs.