From flagos-skills
Runs accuracy (FlagEval) and performance benchmarks (vllm bench serve) across 5 workload profiles against a served model, collecting throughput, latency, TTFT, and TPOT metrics.
How this skill is triggered — by the user, by Claude, or both
Slash command
/flagos-skills:perf-test-flagosThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Start vLLM serve with the target model, run accuracy benchmarks (when FlagEval is
Start vLLM serve with the target model, run accuracy benchmarks (when FlagEval is available) and performance benchmarks (vllm bench serve) across multiple profiles.
perf-test/
├── SKILL.md # This file — execution flow
├── scripts/
│ ├── run_benchmark.py # Run single benchmark profile (JSON output)
│ └── run_all_benchmarks.py # Run all 5 profiles, collect + summarize (JSON)
└── references/
└── benchmark-profiles.md # Profile definitions, metrics, vllm bench usage
Reused from env-verify:
env-verify/scripts/test_serve_mode.py — can be used to verify server is healthy
before benchmarking (optional pre-check)full vs base)If invoked standalone, ask for container name, model path, TP size, and stack config.
If invoked from /flagrelease, these are passed as context.
Use the stack recommended by model-verify. Read references/benchmark-profiles.md
for the vllm serve command pattern.
docker exec -d <CONTAINER> bash -c '
export USE_FLAGGEMS=<0|1>
export FLAGCX_PATH=<path_or_unset>
export VLLM_PLUGINS=<fl_or_unset>
vllm serve <MODEL_PATH> \
--tensor-parallel-size <TP_SIZE> \
--max-num-batched-tokens 4096 \
--max-num-seqs 256 \
--trust-remote-code \
--port 8000 \
<EXTRA_ARGS>
'
Wait for server ready (poll /health, timeout 300s):
docker exec <CONTAINER> bash -c '
for i in $(seq 1 150); do
if curl -s http://localhost:8000/health 2>/dev/null | grep -qE "ok|200|\{\}"; then
echo "SERVER_READY"; break
fi
sleep 2
done
'
If server doesn't start, report error and exit.
docker exec <CONTAINER> bash -c '
curl -s http://localhost:8000/v1/models | python3 -c "
import json, sys; print(json.load(sys.stdin)[\"data\"][0][\"id\"])"
'
STATUS: FlagEval test client not yet available.
When FlagEval becomes available, update this section with:
Current behavior: Report accuracy test as SKIPPED.
Copy scripts into the container and run:
docker cp <SKILL_DIR>/scripts/run_benchmark.py <CONTAINER>:/tmp/
docker cp <SKILL_DIR>/scripts/run_all_benchmarks.py <CONTAINER>:/tmp/
docker exec <CONTAINER> python3 /tmp/run_all_benchmarks.py \
--model <MODEL_NAME> \
--tokenizer <MODEL_PATH> \
--port 8000 \
--output-dir /data/results/perf
The script runs all 5 default profiles (see references/benchmark-profiles.md),
saves per-profile JSON to /data/results/perf/, and outputs a combined JSON report
with a summary table.
Important: One profile failure does NOT skip remaining profiles.
docker exec <CONTAINER> bash -c 'pkill -f "vllm serve" || true'
{
"status": "PASS | PARTIAL | FAIL",
"stage": "perf-test",
"model": "<MODEL_PATH>",
"tensor_parallel_size": 8,
"flags": {"USE_FLAGGEMS": "1|0", "FLAGCX_PATH": "..."},
"accuracy": {
"status": "SKIPPED",
"reason": "FlagEval test client not yet available"
},
"performance": {
"status": "PASS | PARTIAL | FAIL",
"profiles_passed": "5/5",
"profiles": [ "...per-profile results..." ],
"summary_table": "...markdown table..."
}
}
Present the summary table to the user:
| Profile | Input | Output | Prompts | Req/s | Tok/s | TTFT(ms) | TPOT(ms) | P99(ms) | Status |
|---------|-------|--------|---------|-------|-------|----------|----------|---------|--------|
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Status logic:
PASS — all profiles completedPARTIAL — some passed, some failedFAIL — server didn't start or all profiles failed| Failure | Behavior |
|---|---|
| Server fails to start | Report error; exit |
vllm bench serve not found | Report vllm version issue |
| Single profile fails | Report error, continue remaining profiles |
| Single profile times out | Kill after 600s, report partial, continue |
| Server crashes mid-benchmark | Capture logs, report which profile caused crash |
| OOM during high concurrency | Report, suggest reducing num_prompts |
| Operation | Timeout |
|---|---|
| Server startup | 300s |
| Per profile benchmark | 600s |
npx claudepluginhub flagos-ai/skills --plugin flagos-skillsOrchestrates online benchmarks for vLLM inference services using `vllm bench serve`. Supports single/multi-case batch execution with result aggregation and auto-optimization for throughput under latency SLOs (TTFT, TPOT, P99).
Runs vLLM serving benchmarks with synthetic random data to measure throughput, TTFT, TPOT, inter-token latency. Quick tests without external datasets.
Compares SGLang, vLLM, and TensorRT-LLM for the same model and workload to find the best deployment command under a given GPU budget and latency SLA.