From agent-skills
Audit the evaluation layer of an AI product against design-completeness questions: offline criteria, ground-truth quality, online signal, cohort/disparate-impact, adversarial + robustness coverage, and drift detection. Eval-side companion to `ai-ux-review`. Produces an editable Markdown artifact plus a self-contained HTML report under `docs/ai-ux/`. Use when the user asks to "review my AI eval setup", "audit my eval design", "is my AI eval rigorous enough", "responsible-AI eval review", "fairness eval check", or "drift detection design", or ships an LLM/ML feature and wants an eval-rigor check. Invoke even if only one block is named — the others stress-test it. Does NOT trigger for: human-AI UX review (`ai-ux-review`); lean-canvas work (`validation-canvas`); adversarial pre-mortem with a verdict (`startup-grill`); SKILL.md audits (`skill-evaluator`); implementing eval pipelines, writing eval code, or labeling datasets (this skill names gaps, it does not build them).
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-skills:ai-eval-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Audit the eval layer of an AI product or feature against a structured set of
README.mdreferences/blocks/01-necessity-success.mdreferences/blocks/02-ground-truth-labels.mdreferences/blocks/03-offline-eval-design.mdreferences/blocks/04-online-metrics-signal.mdreferences/blocks/05-cohort-disparate-impact.mdreferences/blocks/06-adversarial-robustness.mdreferences/blocks/07-drift-monitoring.mdtemplates/ai-eval-review.htmltemplates/ai-eval-review.mdAudit the eval layer of an AI product or feature against a structured set of design-completeness questions. Each of the seven blocks is a gap detector — the value of this skill is whether the elicitation surfaces eval decisions the builder hasn't made yet, not whether the artifact looks complete.
The job is eval-design-completeness: have we designed how we'll know if this works? Covers offline criteria, ground-truth quality, online signal, cohort breakdowns and disparate impact, adversarial / robustness coverage, and drift detection. Regulatory rigor (EU AI Act, FDA SaMD, FTC) is a cross-cutting lens applied across blocks, not a separate block.
This skill is the eval-side companion to ai-ux-review. Same shape, same
elicitation pattern, different subject — ai-ux-review audits the
human-AI design surface (was the experience intentionally designed?);
this skill audits the measurement layer behind it (do we have signal for
whether the design works?).
This is not an implementation tool. It does not write eval code, label datasets, set up monitoring dashboards, or compute metrics. It names the gaps; you take them to your tools or your team to close.
Always produced under the resolved review root (default
docs/ai-ux/ — same folder as ai-ux-review for clean composition):
ai-eval-review.md — canonical, editable Markdown with one top-level
section per block (seven blocks total), plus a ## Gap Summary section
listing the eval decisions still unmade. Headings are load-bearing if
downstream tools (e.g., a future eval-plan or eval-task generator) need
to grep them.ai-eval-review.html — single self-contained HTML rendering the
seven blocks visually, with [GAP] chips on blocks with unresolved
decisions and a Gap Summary footer. Opens in any browser, prints
cleanly to PDF, zero network dependencies.Both files carry the same content — the HTML is the visual primary; the Markdown is the source of truth.
ai-ux-review. UX design and eval design are
different surfaces. A product can have great evals on a poorly designed
UX (users distrust accurate outputs); it can also have great UX on
poorly evaluated AI (users trust outputs that are silently wrong). Both
matter; use the two skills together.team-composer with @legal_compliance_advisor.pitch-deck.| Skill | Owns |
|---|---|
ai-ux-review Block 6 (Output Integrity) | Design intent: how the product surfaces verifiability, provenance, prompt-injection mitigation, autonomy levels. Was this designed? |
ai-eval-review Block 6 (Adversarial & Robustness) | Measurement: how robustness is tested — red-team coverage, injection-eval results, OOD detection accuracy, jailbreak rates. Is this measured? |
If both skills are run, treat them as a pair: ai-ux-review asks "did we design for X?", ai-eval-review asks "do we measure X?". The boundary is design vs. measurement.
When ai-ux-review.md already exists in the working folder, this skill reads it as context — Block 7 (Success & Evaluation) gaps seed Block 1 here. Don't re-elicit what's already been named.
Influences. This skill is informed by several open eval frameworks and regulatory texts:
- Holistic Evaluation of Language Models (HELM) — Stanford CRFM, Apache 2.0. Multi-dimensional eval (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency).
- Anthropic's claude-cookbooks — MIT. Patterns for designing and running LLM evals.
- OpenAI Evals — MIT. Framework + registry of benchmarks.
GoogleChrome/modern-web-guidance-src— Apache-2.0. Closed-loop calibration methodology: paired gold-standard + negative fixtures to prove a test discriminates the rule it claims to test, plus opportunity (100% − unguided-pass) and uplift (guided − unguided) vocabulary for measuring whether a feature actually helps. Applied here to Block 3 (offline eval set construction must include hard negatives) and Block 4 (online metrics should be framed against an unguided baseline when feasible).- EU AI Act — regulatory text, reusable with attribution. High-risk system requirements drive Block 5 cohort + Block 6 robustness probes.
- FTC AI guidance, FDA SaMD eval expectations — US federal works, public domain.
The skill uses general eval-design concepts (the kinds of decisions to make, not specific benchmark scores) and authors its own elicitation flow, probes, and acceptance criteria. No verbatim content lifted from any of the above sources. See
README.mdfor the full influences note and the copyright-vs-derivative-work reasoning.For the gold+negative fixture pattern in concrete form, cross-reference
skill-evaluator/references/calibration-loop.mdrather than duplicating the methodology here.
Claude must internalize all seven before interviewing. Blocks follow the reasoning order from "what is success?" → "what do we measure against?" → "how do we measure?" → "what do we measure in production?" → "who does it work for?" → "how does it fail?" → "how do we know it's still working?".
| # | Block | Core question |
|---|---|---|
| 1 | Eval necessity & success definition | What does "good enough to ship" mean? What's the offline criterion that, if cleared, means the AI is doing its job? |
| 2 | Ground truth & label quality | Where do labels come from? Who labels them? What's the inter-annotator agreement? Where does ground truth not exist? |
| 3 | Offline eval design | What's the eval set composition? Distribution coverage? Leakage protection? Statistical power? |
| 4 | Online metrics & signal | What's measured in production? Direct vs. proxy. Failure signals that distinguish silent degradation from engagement. |
| 5 | Cohort breakdown & disparate impact | Per-segment performance. Fairness. Disparate impact. Which segments are under-served and how would you know? |
| 6 | Adversarial & robustness | Red-team coverage, prompt-injection eval, OOD detection, jailbreak resistance, distribution shift. |
| 7 | Drift detection & monitoring | How do you know the system is still working tomorrow? Model drift, behavior drift, data drift, alerting cadence. |
Cross-block lens — Regulatory rigor. EU AI Act (high-risk systems), FDA SaMD (risk-class-proportional eval), FTC AI guidance (disparate impact, deception). Not its own block — applied as a Phase 2 check across the seven, surfacing where regulatory rigor changes the bar for Blocks 1, 5, and 6 specifically.
See references/blocks/01-necessity-success.md through
references/blocks/07-drift-monitoring.md for each block's deep probe
questions, acceptance criteria, and common gap patterns. The skill body
uses these references lazily — read them when the phase calls for them,
not all up front.
Goal: establish context — what's being reviewed, what AI type, what lifecycle stage, what regulatory context, what adjacent artifacts exist.
Look for these in order; first match wins:
output_dir arg → use as-is.STARTUP_KIT_DOCS_ROOT env var → ${STARTUP_KIT_DOCS_ROOT}/ai-ux/.docs/startup-kit/ exists → docs/startup-kit/ai-ux/.docs/ai-ux/ exists → docs/ai-ux/ (sibling of an existing ai-ux-review run).docs/ai-ux/.Both ai-ux-review and ai-eval-review share the ai-ux/ folder by
convention — the human-AI design layer and its measurement layer belong
side-by-side.
If ai-eval-review.md already exists at the resolved root, this is an
update-mode run — see "Update mode" below.
Before asking questions, check for files this skill can read as context:
<resolved-root>/ai-ux-review.md — if present, read it. Block 7
(Success & Evaluation) is the load-bearing input: any [Gap — …]
markers there seed this skill's Block 1. Mirror back the gaps and ask:
"ai-ux-review flagged these eval gaps: [list]. Should I seed Block 1
from them, or are you starting fresh?"<resolved-canvas-root>/validation-canvas.md — if present, this
is an AI startup. Block 5 (Cohort breakdown) reads Customer Segments
to anchor per-segment eval questions.<resolved-brand-root>/DESIGN.md — for HTML token styling, same
pattern as ai-ux-review.Ask only the ones not already obvious from context:
ai-ux-review.md
Block 1 if it exists.)Before proceeding to block work, mirror back the framing:
"Reviewing the eval layer for an LLM-powered email-draft feature, in development, no formal regulatory regime. Reading
ai-ux-review.mdBlock 7 gaps — there are three. I'll seed Block 1 from them. Stop me if I should reframe."
Then proceed to Phase 1.
ai-ux-review Block 7 gaps when the file is present.
Read them. Mirror them back as Block 1 seeds. Confirm.Goal: Walk the seven blocks in order. For each, the output is either
a concrete eval decision or an explicit gap marker
([Gap — what hasn't been decided yet]).
| Role | Lens |
|---|---|
@data_scientist | Lead interviewer. Asks block-by-block. Pushes for specificity on labels, metrics, statistical rigor. |
@ai_system_architect | Eval orchestration, infrastructure realism, where eval lives in the system. |
@ai_safety_specialist | Blocks 5 + 6 (cohorts + adversarial). Flags disparate-impact gaps and red-team coverage. |
@senior_product_manager | Block 1 + 4 (necessity + online signal). Flags vague success criteria and proxy metrics treated as direct. |
@legal_compliance_advisor | Cross-block regulatory lens. Surfaces EU AI Act / FDA SaMD / FTC hooks where they apply. |
Do not jump around. Each block sets up constraints later blocks inherit.
For each block:
references/blocks/<NN>-<name>.md.[Gap — what hasn't been decided: <what>].Each block has at least one specific, testable eval decision OR an explicit gap marker. "We'll figure out metrics later" is a gap, not an answer.
Mark [Gap — <what hasn't been decided>: <why it matters>]. A review
with honest gaps is more useful than one with invented confidence.
Full seven-block walk takes 30–60 minutes depending on AI type and lifecycle stage. Single-block work is ~10 minutes — but still invoke the skill, since adjacent blocks stress-test the one being reviewed.
Goal: Surface contradictions, dependencies, and gaps that only show up when blocks are read together. Six mandatory checks plus the regulatory cross-cutting lens.
ai-ux-review.md exists and Block 6 names prompt-injection
mitigations, does this skill's Block 6 actually measure injection
resistance? Designed mitigations without eval signal are theater.Regulatory cross-cutting lens (mandatory when regulatory context is non-trivial): EU AI Act high-risk requires data quality, human oversight, accuracy + robustness, and post-market monitoring — these map to Blocks 2, 4, 6, 7 respectively. FDA SaMD ties eval-rigor to risk class. FTC AI guidance emphasizes disparate impact (Block 5) and truthfulness (Block 6). Surface where regulatory rigor changes the bar, not just where regulators have an opinion.
Append a ## Gap Summary section with the 3–5 most urgent unmade eval
decisions. For each:
Same structure as ai-ux-review. Produce two files at the resolved
root, present via present_files or computer:// links, close with
three lines.
ai-eval-review.mdStructure (headings must match exactly):
# AI Eval Review — [Product / Feature Name]
> Generated on [YYYY-MM-DD]. AI type: [LLM | classical-ML | CV | multi-modal | agentic]. Lifecycle: [pre-MVP | prototype | development | shipped]. Regulatory: [unregulated | consumer-trust-sensitive | EU AI Act limited-risk | EU AI Act high-risk | FDA SaMD class N | FTC adjacent].
## Block 1 — Eval Necessity & Success Definition
- ...
## Block 2 — Ground Truth & Label Quality
- ...
## Block 3 — Offline Eval Design
- ...
## Block 4 — Online Metrics & Signal
- ...
## Block 5 — Cohort Breakdown & Disparate Impact
- ...
## Block 6 — Adversarial & Robustness
- ...
## Block 7 — Drift Detection & Monitoring
- ...
---
## Gap Summary
1. **[Gap in one line]**
- Why it matters: ...
- Cheapest experiment to resolve: ...
ai-eval-review.htmlRead the template at templates/ai-eval-review.html and produce a single
self-contained HTML that:
ai-ux-review for visual parity when both reviews exist for the same
product).[GAP] chip.<resolved-brand-root>/DESIGN.md if present.<review-root>/ai-eval-review.md<review-root>/ai-eval-review.htmlSide-by-side with ai-ux-review.md + ai-ux-review.html when present.
End with three lines:
team-composer with
@legal_compliance_advisor for compliance pass," "if ai-ux-review
hasn't been run for this product, run it next to close the design
side."When ai-eval-review.md already exists:
<!-- updated YYYY-MM-DD: <reason> -->.<review-root>/ai-eval-review.md Canonical, editable source of truth
<review-root>/ai-eval-review.html Self-contained visual review (primary deliverable)
No other files.
Phase 0 (Intake)
ai-ux-review when present)ai-ux-review.md, validation-canvas.md, DESIGN.md, attached eval plan)ai-ux-review.md Block 7 has gaps, they're surfaced and Block 1 is seeded from themContent (per block)
[Gap — ...] markerCross-block (Phase 2)
Rendering
DESIGN.md presentShipping
ai-ux-review.md when present)present_files or computer:// links| Skill | When to Use |
|---|---|
ai-ux-review (our own) | Sibling skill. Different subject, same shape. Run both for a complete review of an AI product. If ai-ux-review.md exists, this skill reads Block 7 as input and seeds Block 1 from [Gap — …] markers. |
validation-canvas (our own) | Upstream when the product is idea-stage. Block 5 (Cohort breakdown) reads Customer Segments. |
brand-workshop (our own) | Upstream when a brand exists. HTML output styled from DESIGN.md. |
team-composer (our own) | Alternative when the builder wants a discussion on one narrow eval question (e.g., "let's debate offline vs. online for our use case") rather than a full review artifact. Use team-composer with @data_scientist + @ai_safety_specialist. |
riskiest-assumption-test (our own) | Composition. If the Gap Summary surfaces eval assumptions that need testing (not just deciding — e.g., "we assume single-annotator labels are accurate enough"), hand them to RAT. |
startup-grill (our own) | Adjacent. Adversarial pre-mortem with verdict. Different mode. |
pitch-deck (our own) | Downstream. Strong eval decisions seed slide content for traction / validation claims. |
| HELM (Stanford CRFM, Apache 2.0) | Implementation reference. Multi-dimensional eval (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) — point at it when Block 3 or Block 5 needs an actual benchmark. |
| Anthropic claude-cookbooks (MIT) | Implementation reference. Patterns for designing LLM evals; useful when Block 3 needs concrete examples. |
| OpenAI Evals (MIT) | Implementation reference. Framework + benchmark registry; useful when Block 3 needs to scope an eval-set build. |
theme-factory (Anthropic) | When the HTML review needs branded styling and no DESIGN.md exists. |
pdf (Anthropic) | When merging the review into a regulatory submission or board packet. |
docx (Anthropic) | When the review needs to ship as .docx. |
ai-safety-mindset (Anthropic) | When the team lacks shared vocabulary for responsible-AI eval. Block 5 + 6 specifically benefit. |
Principle: this skill owns the eval-design-completeness artifact —
whether the measurement layer of an AI product has been intentionally
designed. It does not implement the eval (that's the team's eval
platform), validate beliefs (that's validation-canvas), test hypotheses
(riskiest-assumption-test), pitch the product (pitch-deck),
adversarially probe it (startup-grill), or audit the human-AI design
surface (that's ai-ux-review — the sibling skill).
Graceful degradation: if a referenced skill is not installed, this
skill still ships ai-eval-review.md + .html. Cross-skill chains are
enhancements, not requirements.
references/blocks/01-necessity-success.md — what success means; offline criterion; AI-type-specific success patternsreferences/blocks/02-ground-truth-labels.md — label sources; quality; coverage gaps; inter-annotator agreementreferences/blocks/03-offline-eval-design.md — eval set composition; distribution coverage; leakage protection; statistical powerreferences/blocks/04-online-metrics-signal.md — production metrics; proxy vs. direct; failure signal vs. engagementreferences/blocks/05-cohort-disparate-impact.md — per-segment performance; fairness; harm distribution; regulatory hooksreferences/blocks/06-adversarial-robustness.md — red-team coverage; prompt injection; OOD detection; jailbreak resistance; classical-ML adversarialreferences/blocks/07-drift-monitoring.md — model / behavior / data drift; alerting; retraining cadence; monitoring infrastructureRead these when the phase calls for them. Do not front-load all references
at once — that's the progressive disclosure pattern this repo uses (see
CLAUDE.md → "Harness vocabulary").
Tags: ai, eval, evaluation, mlops, responsible-ai, fairness, drift, llm, design-review
npx claudepluginhub sorawit-w/agent-skills --plugin agent-skillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.