From systems-design
Provides LLM serving optimization recommendations for latency, inference costs, and throughput. Scans configs, detects stacks like vLLM/TGI, suggests quantization, batching, KV cache, and framework changes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/systems-design:optimize-llmThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Get quick, actionable recommendations for LLM serving optimization.
Get quick, actionable recommendations for LLM serving optimization.
/sd:optimize-llm [focus]
focus (optional): Optimization priority
latency - Focus on reducing response timecost - Focus on reducing inference coststhroughput - Focus on maximizing requests/second/sd:optimize-llm
/sd:optimize-llm latency
/sd:optimize-llm cost
Gather Context
Spawn LLM Optimization Advisor Agent
Use the llm-optimization-advisor agent to analyze and provide recommendations. The agent specializes in:
Present Recommendations Display optimization opportunities organized by:
## LLM Optimization Report
### Current Setup
- Model: [detected or ask]
- Framework: [detected or unknown]
- Hardware: [detected or ask]
### Quick Wins
1. [Optimization] - [Expected impact]
2. ...
### Medium Effort Optimizations
1. [Optimization] - [Expected impact]
2. ...
### Advanced Optimizations
1. [Optimization] - [Expected impact]
2. ...
### Estimated Total Impact
- Latency: [X]% improvement
- Cost: [X]% reduction
- Throughput: [X]x increase
npx claudepluginhub melodic-software/claude-code-plugins --plugin systems-designProvides patterns for LLM inference infrastructure with serving frameworks like vLLM, TGI, TensorRT-LLM; quantization, batching strategies, KV cache, and streaming responses. Use for optimizing latency and scaling deployments.
Compares SGLang, vLLM, and TensorRT-LLM for the same model and workload to find the best deployment command under a given GPU budget and latency SLA.
Orchestrates online benchmarks for vLLM inference services using `vllm bench serve`. Supports single/multi-case batch execution with result aggregation and auto-optimization for throughput under latency SLOs (TTFT, TPOT, P99).