By BBuf
Autonomously optimize LLM serving infrastructure — profile torch traces, benchmark SGLang/vLLM/TensorRT-LLM, simulate capacity and compute, and run RLCR loops that patch code to match or beat competitor performance. Also includes human-like PR review and incident triage for production serving.
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA.
Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.
Unified LLM torch-profiler triage skill for `sglang`, `vllm`, and `TensorRT-LLM`. Use it to inspect an existing `trace.json(.gz)` or profile directory, or to drive live profiling against a running server and return one three-table report with kernel, overlap-opportunity, and fuse-pattern tables.
Return public original model architecture diagrams for user-specified LLM, VLM, MoE, diffusion, OCR, and SGLang/sgl-cookbook model families. Use when the user asks for a model structure chart, architecture diagram, or rendered image link for a specific model such as DeepSeek, GLM, Qwen, Kimi, MiniMax, Step, Hunyuan, or Qwen3-VL.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
Agent-ready playbooks for LLM serving benchmarks, capacity planning, torch-profiler triage, pipeline analysis, compute simulation, SGLang/vLLM optimization, human code review, production incidents, and model PR intelligence.
This repository is built for AI infrastructure engineers who want agents to do real work, not recite generic prompts.
It gives an agent the operational memory needed to benchmark SGLang, vLLM, and TensorRT-LLM fairly; explain serving capacity from startup logs; split prefill and decode profiler evidence; inspect traces at layer and kernel level; estimate operator FLOPs and MFU; review SGLang patches against real maintainer discussion patterns; run Humanize-governed SGLang and vLLM SOTA loops; triage SGLang production incidents from a replay; and keep model-family optimization history close to the code that actually changed.
For standalone kernel campaigns and kernel evidence tools, see the sibling project KDA-Pilot.
If this saves you one stale model-support assumption, one misleading profiler trace, or one late-night benchmark loop, a star helps more AI-infra engineers find it.
| Skill | Use it when |
|---|---|
llm-serving-auto-benchmark | You need a fair, bounded serving benchmark search for SGLang, vLLM, TensorRT-LLM, or another OpenAI-compatible stack. |
llm-serving-capacity-planner | You need to explain SGLang or vLLM startup memory, KV cache budget, request capacity, or OOM pressure from logs. |
llm-torch-profiler-analysis | You need a three-table profiler report that keeps extend/prefill and decode evidence separate. |
llm-pipeline-analysis | You need forward-pass, layer, and kernel-level timing from a torch profiler trace, including anchor boundaries and Perfetto ranges. |
model-compute-simulation | You need operator shapes, FLOPs, MFU estimates, kernel-to-op mapping, or parallelism what-if analysis for an LLM serving shape. |
sglang-humanize-review | You need SGLang code-review findings grounded in full human PR review episodes from project start through the latest refresh (June 2026), including inline code context, top-level discussion, review summaries, and multi-round replies. Every review opens with a PR comprehension pass — a change summary plus a Mermaid execution flowchart with the diff's modified steps marked — so the reviewer sees how the PR runs before the findings. |
sglang-sota-humanize-loop | You want one model-level Humanize RLCR loop that owns gap decisions, profiler triage, required layer-pipeline deep dives, SGLang patches, optional ncu-report-skill evidence, and real-model revalidation after the fixed fair benchmark. |
vllm-sota-humanize-loop | You want one model-level Humanize RLCR loop that owns gap decisions, profiler triage, required layer-pipeline deep dives, vLLM patches, optional ncu-report-skill evidence, and real-model revalidation after the fixed fair benchmark. |
sglang-prod-incident-triage | You need to turn queue growth, timeouts, wrong outputs, crashes, or distributed stalls into a replay and next debug step. |
model-architecture-diagram | You need original public architecture diagrams for popular LLM, VLM, MoE, OCR, and diffusion model families. |
npx claudepluginhub bbuf/ai-infra-auto-driven-skills --plugin ai-infra-auto-driven-skillsHumanize - An iterative development plugin that uses Codex to review Claude's work. Creates a feedback loop where Claude implements plans and Codex independently reviews progress, ensuring quality through continuous refinement.
Deploy and benchmark vLLM with Claude Code
Claude Code skill pack for Langfuse LLM observability (24 skills)
Agent Skills for NeMo Evaluator SDK
ML engineering plugin: Give your AI coding agent ML engineering superpowers.
Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
Evaluate and compare ML model performance metrics