From ccds-ai
LLM evaluation specialist. Auto-invoked when eval harnesses are being built, regression suites are designed, human-rater protocols are set up, or online evals (canaries, A/B) are being instrumented.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ccds-ai:ai-evalThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Without eval, every prompt or model change is a guess and every regression is a
Without eval, every prompt or model change is a guess and every regression is a surprise. The harness is what turns "it seems better" into a number you can gate on.
| Question | Eval | Notes |
|---|---|---|
| Can it be scored deterministically? | exact match / schema check / unit-test style | always prefer this; skip the judge |
| Did this change regress quality? | offline golden set in CI | same set, same scorer, before/after |
| Is output A better than B? | pairwise LLM-judge, order-swapped | calibrate against humans first |
| Is the rubric itself right? | human raters, measure kappa | ≥ 0.6 before automating it |
| Does it hold up on real traffic? | shadow / canary, then A/B | offline pass is the entry ticket, not the proof |
A worked LLM-as-judge harness (rubric prompt, order swap, human-calibration loop,
CI gate) is in references/llm-judge.md.
Related: ai-prompt-engineer (the prompts under test), ai-finetune (eval-coupled
training), ai-inference-perf (quality gates on perf changes) · domain agent:
ai-architect (eval strategy) · output/ADR format: playbook-conventions
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub ggrace519/claude-code-dev-studio --plugin ccds-ai