From vastai-pack
Polls Vast.ai GPU instances for status changes and triggers handlers on lifecycle events like running, exited, or error. Enables monitoring, auto-recovery, and orchestration workflows.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vastai-pack:vastai-webhooks-eventsThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Build event-driven workflows around Vast.ai GPU instance lifecycle. Vast.ai does not provide traditional webhooks, so event detection relies on polling the REST API at `cloud.vast.ai/api/v0` and reacting to instance status transitions (loading, running, exited, error, offline).
Build event-driven workflows around Vast.ai GPU instance lifecycle. Vast.ai does not provide traditional webhooks, so event detection relies on polling the REST API at cloud.vast.ai/api/v0 and reacting to instance status transitions (loading, running, exited, error, offline).
import time, json, subprocess
from typing import Callable, Dict, List
class InstanceEventPoller:
"""Poll Vast.ai API and emit events on status transitions."""
def __init__(self, api_key: str, poll_interval: int = 30):
self.api_key = api_key
self.poll_interval = poll_interval
self.previous_states: Dict[int, str] = {}
self.handlers: Dict[str, List[Callable]] = {}
def on(self, event: str, handler: Callable):
self.handlers.setdefault(event, []).append(handler)
def poll_once(self):
result = subprocess.run(
["vastai", "show", "instances", "--raw"],
capture_output=True, text=True)
instances = json.loads(result.stdout)
for inst in instances:
inst_id = inst["id"]
status = inst.get("actual_status", "unknown")
prev = self.previous_states.get(inst_id)
if prev and prev != status:
event = f"{prev}_to_{status}"
for handler in self.handlers.get(event, []):
handler(inst)
for handler in self.handlers.get("any_change", []):
handler(inst, prev, status)
self.previous_states[inst_id] = status
def run(self):
print(f"Polling every {self.poll_interval}s...")
while True:
self.poll_once()
time.sleep(self.poll_interval)
def on_instance_running(instance):
print(f"Instance {instance['id']} is RUNNING")
print(f" SSH: ssh -p {instance['ssh_port']} root@{instance['ssh_host']}")
# Trigger: start training job, send notification, etc.
def on_instance_exited(instance):
print(f"Instance {instance['id']} EXITED")
# Trigger: collect results, check for errors, notify team
def on_spot_preemption(instance, old_status, new_status):
if old_status == "running" and new_status in ("exited", "offline"):
print(f"ALERT: Instance {instance['id']} may have been preempted")
# Trigger: auto-recovery, provision replacement
# Wire up handlers
poller = InstanceEventPoller(api_key)
poller.on("loading_to_running", on_instance_running)
poller.on("running_to_exited", on_instance_exited)
poller.on("any_change", on_spot_preemption)
poller.run()
def auto_recover(instance, old_status, new_status):
"""Automatically replace preempted instances."""
if old_status != "running" or new_status not in ("exited", "offline", "error"):
return
gpu_name = instance.get("gpu_name", "RTX_4090")
image = instance.get("image_uuid", "pytorch/pytorch:latest")
print(f"Auto-recovering {instance['id']} ({gpu_name})...")
# Search for replacement
offers = json.loads(subprocess.run(
["vastai", "search", "offers",
f"gpu_name={gpu_name} reliability>0.98 rentable=true",
"--order", "dph_total", "--raw", "--limit", "3"],
capture_output=True, text=True, check=True).stdout)
if offers:
new_id = json.loads(subprocess.run(
["vastai", "create", "instance", str(offers[0]["id"]),
"--image", image, "--disk", "50", "--raw"],
capture_output=True, text=True, check=True).stdout)["new_contract"]
print(f"Replacement instance: {new_id}")
def track_costs(instance, old_status, new_status):
"""Log cost events for billing tracking."""
if new_status == "running":
print(f"BILLING START: Instance {instance['id']} "
f"at ${instance.get('dph_total', 0):.3f}/hr")
elif old_status == "running":
print(f"BILLING STOP: Instance {instance['id']}")
| Error | Cause | Solution |
|---|---|---|
| Missed status transition | Poll interval too long | Reduce to 15-30s for critical instances |
| False preemption alert | Instance restarted intentionally | Track expected state changes |
| Auto-recovery loops | Same host keeps failing | Exclude failed host IDs from search |
| API timeout during poll | Network or rate limiting | Retry with backoff; continue polling |
For performance optimization, see vastai-performance-tuning.
Slack notifications: Wire on_instance_running to send a Slack message with SSH connection details. Wire on_spot_preemption to alert the team.
Training monitor: Track running_to_exited events. If exit was expected (job complete), collect results. If unexpected, trigger auto-recovery with checkpoint resume.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin vastai-packExecutes Vast.ai GPU incident response: triages failures/outages, handles spot preemptions/training crashes, provisions replacements using CLI bash scripts and SSH.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.