From deploy-skill
Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.
How this skill is triggered — by the user, by Claude, or both
Slash command
/deploy-skill:vllm-bench-random-syntheticThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.
Run a quick performance benchmark on a vLLM server using synthetic random data. This skill measures core serving metrics including request throughput, token throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and inter-token latency.
pip install vllm)The simplest way to run the benchmark:
# Start vLLM server (in background or separate terminal)
vllm serve Qwen/Qwen2.5-1.5B-Instruct
# Run benchmark with random synthetic data
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 10
Note:
--backend openai-chat with endpoint /v1/chat/completions for online benchmarks.| Parameter | Description | Default |
|---|---|---|
--backend | Backend type: vllm, openai, openai-chat | vllm |
--model | Model name (must match the server) | Required |
--endpoint | API endpoint path | /v1/completions or /v1/chat/completions |
--dataset-name | Dataset to use | random (synthetic) |
--num-prompts | Number of requests to send | 10 |
--port | Server port | 8000 |
--max-concurrency | Maximum concurrent requests | Auto |
--save-result | Save results to file | Off |
--result-dir | Directory to save results | ./ |
When successful, you will see output like:
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total token throughput (tok/s): 619.85
---------------Time to First Token----------------
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
---------------Inter-token Latency----------------
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
==================================================
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-1.5B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 50 \
--save-result \
--result-dir ./benchmark-results/
vllm bench serve \
--backend openai-chat \
--model meta-llama/Llama-3.1-8B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random \
--num-prompts 100 \
--port 8001 \
--max-concurrency 4
For quick testing (small models, fast):
Qwen/Qwen2.5-1.5B-Instruct (recommended for quick tests)facebook/opt-125mfacebook/opt-350mFor realistic benchmarks (medium models):
Qwen/Qwen2.5-7B-Instructmeta-llama/Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.3vllm --version to verifycurl http://localhost:8000/health to checkvllm serve <model-name> (wait for "Application startup complete")vllm bench serve with appropriate parameterskill <PID>Server not responding:
curl http://localhost:8000/health--port flag if server is on different portModel not found:
export HF_TOKEN=<your_token> if neededOut of memory:
--num-prompts or --max-concurrencyConnection refused:
random dataset generates synthetic prompts automatically--num-promptsnpx claudepluginhub ben-cpy/deploy-skill --plugin deploy-skillProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.