From ccds-ai
Inference performance and cost specialist. Auto-invoked when inference latency, throughput, token cost, batching, quantization, KV caching, or GPU scheduling is being tuned.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ccds-ai:ai-inference-perfThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Serving-layer changes compound across every request — this is where the cost and
Serving-layer changes compound across every request — this is where the cost and latency wins live. Every perf change is also a potential quality regression in disguise, so measurement and quality gates travel together.
| Lever | Typical win | Quality risk | Reach for it when |
|---|---|---|---|
| Prompt/prefix caching | large on shared-prefix workloads | none | always first |
| Trim prompt / cap output length | linear in tokens cut | low — still eval it | bloated system prompts, stale few-shots |
| Streaming + TTFT focus | perceived latency | none | any interactive surface |
| Continuous batching | 2–10× throughput | none (TPOT rises slightly) | self-hosted, concurrency > 1 |
| Quantization (INT8, AWQ/GPTQ 4-bit) | 2–4× memory, faster decode | real — task eval required | GPU memory-bound |
| Speculative decoding | 1.5–3× decode speedup | none (output distribution preserved) | latency-bound, draft model available |
| Smaller / distilled model | step change in cost | high — full eval | quality bar has headroom |
Related: ai-eval (quality gates on perf changes), ai-rag (retrieval inside the
latency budget), ai-finetune (serving adapters) · domain agent: ai-architect
(model selection, serving topology, cost/latency budgets) · output/ADR format:
playbook-conventions
npx claudepluginhub ggrace519/claude-code-dev-studio --plugin ccds-aiProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.