From vllm-skills
Benchmarks vLLM automatic prefix caching efficiency using fixed prompts, ShareGPT dataset, or synthetic prefix/suffix patterns. Compares throughput and latency with/without caching for repeated prompts.
How this skill is triggered — by the user, by Claude, or both
Slash command
/vllm-skills:vllm-prefix-cache-benchThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Benchmark the efficiency of vLLM's automatic prefix caching (APC) feature. The offline script `benchmarks/benchmark_prefix_caching.py` runs directly against the vLLM engine (no server required). For online/serving tests, use `vllm bench serve` with the `prefix_repetition` dataset.
Benchmark the efficiency of vLLM's automatic prefix caching (APC) feature. The offline script benchmarks/benchmark_prefix_caching.py runs directly against the vLLM engine (no server required). For online/serving tests, use vllm bench serve with the prefix_repetition dataset.
--enable-prefix-caching.Runs a synthetic benchmark with a fixed prompt repeated multiple times to directly measure cache hit efficiency. No dataset download required.
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256
To compare against the baseline without caching:
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--no-enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100 \
--input-length-range 128:256
Uses real-world conversational data from ShareGPT to evaluate prefix caching with naturally occurring prompt sharing.
First, download the dataset:
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Then run the benchmark:
python3 benchmarks/benchmark_prefix_caching.py \
--model Qwen/Qwen3-8B \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--enable-prefix-caching \
--num-prompts 20 \
--repeat-count 5 \
--input-length-range 128:256
Uses vllm bench serve with the synthetic prefix_repetition dataset to test caching via the serving API. This requires a running vLLM server.
First, start the server:
vllm serve Qwen/Qwen3-8B
Then run the benchmark:
vllm bench serve \
--backend openai \
--model Qwen/Qwen3-8B \
--dataset-name prefix_repetition \
--num-prompts 100 \
--prefix-repetition-prefix-len 512 \
--prefix-repetition-suffix-len 128 \
--prefix-repetition-num-prefixes 5 \
--prefix-repetition-output-len 128
Key parameters for prefix_repetition:
| Parameter | Description |
|---|---|
--prefix-repetition-prefix-len | Number of tokens in the shared prefix portion |
--prefix-repetition-suffix-len | Number of tokens in the unique suffix portion |
--prefix-repetition-num-prefixes | Number of distinct prefixes to cycle through |
--prefix-repetition-output-len | Number of output tokens to generate per request |
cd vllm).Qwen/Qwen3-8B) unless the user specifies a different one or the model is unavailable; change only --model.--repeat-count in Option 1 and 2 controls how many times each sampled prompt is replayed; higher values increase cache hit rate.--input-length-range accepts a min:max token range, e.g. 128:256.--tensor-parallel-size <N>.--prefix-caching-hash-algo xxhash (requires pip install xxhash).benchmark_prefix_caching.py| Argument | Required | Description |
|---|---|---|
--model | Yes | Model name or path (HuggingFace ID or local path) |
--num-prompts | Yes | Number of prompts to process |
--input-length-range | Yes | Token length range for inputs, e.g. 128:256 |
--repeat-count | No | Number of times each prompt is repeated (default: 1) |
--dataset-path | No | Path to a dataset file (e.g. ShareGPT JSON). Omit for synthetic fixed-prompt mode |
--prefix-len | No | Fixed prefix token length to prepend to every prompt |
--output-len | No | Number of output tokens to generate per request |
--sort | No | Sort prompts by length before benchmarking |
--enable-prefix-caching / --no-enable-prefix-caching | No | Toggle APC (recommended: enable to test caching) |
--prefix-caching-hash-algo | No | Hash algorithm: sha256, sha256_cbor, xxhash, xxhash_cbor |
--tensor-parallel-size | No | Number of GPUs for tensor parallelism |
--disable-detokenize | No | Skip detokenization to reduce overhead |
python3 benchmarks/*.py reports file not found, locate your local vLLM repository first and run the command from that repo root.git clone https://github.com/vllm-project/vllm
cd vllm
export HF_TOKEN=<your_token> or pass --hf-token <your_token>.xxhash or cbor2 is not installed and you use those hash algorithms, install them first: pip install xxhash cbor2.npx claudepluginhub vllm-project/vllm-skills --plugin vllm-skillsOrchestrates online benchmarks for vLLM inference services using `vllm bench serve`. Supports single/multi-case batch execution with result aggregation and auto-optimization for throughput under latency SLOs (TTFT, TPOT, P99).
Autonomously optimizes vLLM serving performance to match or beat SGLang/TensorRT-LLM using RLCR loop, bottleneck profiling, kernel analysis, and code patching.
Runs vLLM serving benchmarks with synthetic random data to measure throughput, TTFT, TPOT, inter-token latency. Quick tests without external datasets.