vastai-rate-limits | vastai-pack

Stats

Actions

Tags

vastai-rate-limits | vastai-pack

Vast.ai Rate Limits

Overview

Handle Vast.ai REST API rate limits gracefully. The API at cloud.vast.ai/api/v0 returns HTTP 429 when request limits are exceeded. Most operations (search, show) are read-heavy and rarely hit limits, but automated scripts doing rapid provisioning or polling can trigger throttling.

Prerequisites

Vast.ai CLI or REST API client
Understanding of exponential backoff

Instructions

Step 1: Rate-Limited HTTP Client

import requests
import time

class RateLimitedVastClient:
    BASE_URL = "https://cloud.vast.ai/api/v0"

    def __init__(self, api_key, min_delay=0.5, max_retries=5):
        self.session = requests.Session()
        self.session.headers["Authorization"] = f"Bearer {api_key}"
        self.min_delay = min_delay
        self.max_retries = max_retries
        self.last_request = 0

    def request(self, method, endpoint, **kwargs):
        # Enforce minimum delay between requests
        elapsed = time.time() - self.last_request
        if elapsed < self.min_delay:
            time.sleep(self.min_delay - elapsed)

        for attempt in range(self.max_retries):
            self.last_request = time.time()
            resp = self.session.request(method, f"{self.BASE_URL}{endpoint}", **kwargs)

            if resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 2 ** attempt))
                print(f"Rate limited. Waiting {retry_after}s (attempt {attempt+1})")
                time.sleep(retry_after)
                continue

            resp.raise_for_status()
            return resp.json()

        raise RuntimeError("Max retries exceeded due to rate limiting")

Step 2: Polling with Adaptive Backoff

def poll_instance_status(client, instance_id, target="running", timeout=300):
    """Poll instance status with increasing intervals."""
    start = time.time()
    interval = 5  # Start at 5s, increase to max 30s

    while time.time() - start < timeout:
        info = client.request("GET", f"/instances/{instance_id}/")
        status = info.get("actual_status", "unknown")

        if status == target:
            return info
        if status in ("error", "offline"):
            raise RuntimeError(f"Instance {instance_id} failed: {status}")

        time.sleep(interval)
        interval = min(interval * 1.5, 30)

    raise TimeoutError(f"Instance did not reach '{target}' within {timeout}s")

Step 3: Batch Search with Throttling

def batch_search(client, gpu_configs):
    """Search for multiple GPU types with rate-limit-safe delays."""
    results = {}
    for config in gpu_configs:
        query = GPUQuery(**config).to_filter()
        offers = client.request("GET", "/bundles/", params={"q": str(query)})
        results[config.get("gpu_name", "any")] = offers.get("offers", [])
        time.sleep(1)  # Be polite between searches
    return results

# Usage
configs = [
    {"gpu_name": "RTX_4090", "max_dph": 0.30},
    {"gpu_name": "A100", "max_dph": 2.00},
    {"gpu_name": "H100_SXM", "max_dph": 4.00},
]
all_offers = batch_search(client, configs)

Step 4: Request Optimization

Strategies to reduce API calls:

Cache search results: Offers change slowly; cache for 60-120 seconds
Use --limit: Restrict search results to what you need
Batch instance checks: Use show instances (lists all) instead of individual show instance ID calls
Avoid polling loops: Use longer intervals (15-30s) for status checks

Output

Rate-limited HTTP client with automatic retry on 429
Adaptive polling for instance status changes
Batch search with inter-request delays
Request optimization strategies

Error Handling

Scenario	Response
First 429	Wait `Retry-After` header value, then retry
Repeated 429s	Double wait time between retries
429 during provisioning	Instance creation is idempotent; safe to retry
429 during search	Cache previous results and use them temporarily

Resources

Next Steps

For security best practices, see vastai-security-basics.

Examples

Safe multi-instance provisioning: Create 10 instances with 2-second delays between each create instance call to avoid triggering rate limits during cluster setup.

Efficient monitoring: Poll all instances with a single show instances call every 30 seconds instead of individual calls per instance.