From agent-skills
Audit an AI product against structured human-AI design questions — classical UX (user need, mental models, trust calibration, feedback + control, graceful failure) plus the gen-AI integrity surface older frameworks miss: hallucination handling, output verifiability, provenance + citation, prompt-injection exposure, and agent autonomy. Produces an editable Markdown artifact plus a self-contained HTML report under `docs/ai-ux/`. Use when the user asks to "review my AI feature", "audit my AI product", "is my AI experience trustworthy", "responsible AI UX review", or ships an LLM feature and wants a pre-launch trust review. If `validation-canvas` output exists, assume the business model is settled and focus on UX execution. Does NOT trigger for: lean-canvas work (`validation-canvas`); adversarial pre-mortem with a verdict (`startup-grill`); SKILL.md audits (`skill-evaluator`); brand/logo work (`brand-workshop`); building eval pipelines or writing eval code (names gaps only).
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-skills:ai-ux-reviewThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Audit an AI product or feature against a structured set of human-AI design
README.mdreferences/blocks/01-why-ai.mdreferences/blocks/02-mental-model.mdreferences/blocks/03-trust-calibration.mdreferences/blocks/04-feedback-control.mdreferences/blocks/05-errors-graceful-failure.mdreferences/blocks/06-output-integrity.mdreferences/blocks/07-success-evaluation.mdtemplates/ai-ux-review.htmltemplates/ai-ux-review.mdAudit an AI product or feature against a structured set of human-AI design questions. Each of the seven blocks is a gap detector, not a template field — the value of this skill is whether the elicitation surfaces design decisions the builder hadn't made yet, not whether the artifact looks complete.
The job is design-completeness: have we designed the human side of this responsibly? It covers both the classical human-AI UX surface (user need, mental models, trust, feedback, errors) and the gen-AI integrity surface that showed up after 2022 (hallucination handling, verifiability, provenance, prompt-injection exposure, agent autonomy).
This is not a quality assessment of an AI product's output. It is a structured review of whether the design around that output is intentional.
Always produced under the resolved review root (default docs/ai-ux/):
ai-ux-review.md — canonical, editable Markdown with one top-level
section per block (seven blocks total), plus a ## Gap Summary section
listing the design decisions still unmade. Headings are load-bearing for
future composability with a potential ai-eval-rubric companion skill.ai-ux-review.html — single self-contained HTML file rendering the
seven blocks visually, with gap-marker glyphs for blocks where the builder
said "we haven't designed for this yet." Opens in any browser, prints
cleanly to PDF, zero network dependencies.Both files carry the same content — the HTML is the visual primary; the Markdown is the source of truth the builder edits as the product matures.
validation-canvas.
This skill assumes the business model is settled and audits the UX
execution layer.startup-grill.skill-evaluator. This skill audits AI
product designs, not agent-skill text.pitch-deck.team-composer with @legal_compliance_advisor for compliance work.This skill intentionally overlaps with team-composer (@ux_researcher and
@ai_safety_specialist are active there too) but differs in deliverable:
ai-ux-review when: the builder wants a persistent artifact they
can return to, edit, and share. The seven-block structure and the gap
summary are the load-bearing features.team-composer with @ux_researcher + @ai_safety_specialist when:
the builder wants a discussion on one narrow question (one specific trust
affordance, one prompt-injection mitigation) without committing to a full
review. Discussion-grade, not artifact-grade.When validation-canvas output already exists in the working folder, this
skill reads it as context — Problem, Customer Segments, UVP are inputs, not
re-elicited. Block 1 ("Why AI here?") becomes a pressure test rather than a
discovery.
Influences. This skill is inspired by Google's People + AI Guidebook (CC BY-NC-SA 4.0). The Guidebook's six-chapter scaffolding is the conceptual ancestor of blocks 1–5. This skill is not a derivative work under copyright law — it uses general AI UX concepts (ideas / facts, not protected expression) and authors its own elicitation flow, probes, and acceptance criteria. No Guidebook prose, worksheets, illustrations, or pattern names are reproduced. See
README.mdfor full attribution.
Claude must internalize all seven before interviewing. Blocks 1–5 cover the classical human-AI UX surface; block 6 is the gen-AI integrity layer; block 7 closes the loop with success measurement.
| # | Block | Core question |
|---|---|---|
| 1 | Why AI here? | What human task is the AI doing, and what makes AI the right tool here vs. a deterministic alternative? |
| 2 | Mental model | What model will users build of how this works, and where will their model diverge from reality? |
| 3 | Trust calibration | When and how does the system surface confidence, sources, or limits? What are the under-trust and over-trust failure paths? |
| 4 | Feedback & control | What can users do when the AI is wrong? How do they correct, override, hand off, or take over? |
| 5 | Errors & graceful failure | What does failure look like in the UI, and what's the recovery path per severity tier? |
| 6 | Output integrity | Hallucination handling, provenance + citation, output verifiability, prompt-injection surface, multi-turn drift, agent-autonomy levels. |
| 7 | Success & evaluation | What does "working" mean in production from the user's perspective? What's measurable, what's a proxy, what's left to judgment? |
See references/blocks/01-why-ai.md through references/blocks/07-success-eval.md
for each block's deep probe questions, acceptance criteria, and common gap
patterns. The skill body uses these references lazily — read them when the
phase calls for them, not all up front.
Goal: establish context — what's being reviewed, who built it, what artifacts already exist — so the elicitation can adapt.
Look for these in order; first match wins:
output_dir arg → use as-is.STARTUP_KIT_DOCS_ROOT env var → ${STARTUP_KIT_DOCS_ROOT}/ai-ux/.docs/startup-kit/ exists → docs/startup-kit/ai-ux/.
Surface: "Writing to docs/startup-kit/ai-ux/ (smart default — kit folder
exists). Set STARTUP_KIT_DOCS_ROOT=./docs for standalone."docs/ai-ux/.If a prior ai-ux-review.md exists at the resolved root, this is an
update-mode run — see "Update mode" below.
Before asking questions, check for files this skill can read as context:
<resolved-canvas-root>/validation-canvas.md — if present, the business
model is settled. Read it. Block 1's probes shift from "what's the user
task" (discovery) to "is your stated UVP defensible without the AI?"
(pressure test).<resolved-brand-root>/DESIGN.md — if present, use brand tokens for
the HTML output. Extract colors.primary from YAML front matter per the
Google Labs spec.Ask only the ones not already obvious from context:
Before proceeding to block work, mirror back the framing so the builder can correct you:
"Reviewing an LLM-powered email-draft feature, currently in development. Validation canvas exists — I'll skip business-model questions and pressure-test the UX execution. Stop me if I should reframe."
Then proceed to Phase 1.
validation-canvas is
present. Read it instead. If the canvas is silent on a relevant fact,
surface that as a gap in the canvas, not as a question for this skill.Goal: Walk the seven blocks in order, eliciting design decisions per block.
For each block, the output is either a concrete answer (the design decision)
or an explicit gap marker ([Gap — what hasn't been designed yet]).
Run the interview in first-person voice with these roles active. If
team-composer is invoked separately, skip duplicates.
| Role | Lens |
|---|---|
@ux_researcher | Lead interviewer. Asks block-by-block. Pushes for specificity. |
@ai_safety_specialist | Trust calibration, output integrity, autonomy levels (blocks 3, 6). Flags responsible-AI gaps. |
@lead_behavioral_scientist | Mental model and trust dynamics (blocks 2, 3). Flags assumptions about user cognition. |
@senior_product_designer | Feedback affordances, error UI, recovery paths (blocks 4, 5). Flags UI patterns that won't survive contact with users. |
@senior_product_manager | Success criteria and measurability (block 7). Flags "metrics" that are actually aspirations. |
The order matters — earlier blocks set up the constraints later blocks inherit. Do not jump around.
For each block:
references/blocks/<NN>-<name>.md.[Gap — <what's missing>]. Gaps are data, not
failure — they roll up into Phase 2's Gap Summary.Each block has at least one specific, testable design decision OR an explicit gap marker. "We have trust calibration" is not enough; "When confidence is below 0.7, the chip color shifts to amber and we add 'verify before using' text" is enough. "We haven't decided how to surface confidence yet" is also enough — it's a gap, not a missing answer.
Mark the block [Gap — <what hasn't been designed>: <why it matters>]. A
review with honest gaps is more useful than a review with invented confidence.
The full seven-block walk takes 30–60 minutes depending on product complexity. Single-block work (the user only wants to review one block) is ~10 minutes — but still invoke the skill, since the surrounding blocks stress-test the one being reviewed.
Goal: Surface contradictions, dependencies, and gaps that only show up when the blocks are read together — the cross-block pressure test.
Apply these six checks. They are mandatory.
Why AI ↔ Output integrity: If Block 1 said "AI is necessary because X requires generative output," does Block 6 actually design for the integrity risks of generative output? Necessity claims that don't take on the cost of generation are red flags.
Mental model ↔ Trust calibration: Is the trust calibration in Block 3 aligned with the mental model in Block 2? A system that surfaces confidence scores has implicitly committed users to a probabilistic mental model — does Block 2 actually say that?
Trust calibration ↔ Feedback & control: If Block 3 says users will sometimes need to verify, does Block 4 actually give them the affordances to verify? Cited sources without click-to-source is a flag.
Errors ↔ Feedback & control: Every failure mode in Block 5 should have a corresponding recovery affordance in Block 4. Failure modes without recovery paths are gaps.
Output integrity ↔ Success & evaluation: Block 6's mitigations (hallucination guards, citation requirements, agent autonomy limits) should show up as measurable signal in Block 7. A mitigation with no measurement plan is theater.
All blocks ↔ Lifecycle stage: A shipped product with [Gap] markers
in Blocks 3, 5, or 6 is a different urgency than the same gaps in a
prototype. Tag the urgency of each gap based on lifecycle.
Append a ## Gap Summary section to ai-ux-review.md with the 3–5 most
urgent unmade design decisions. For each:
Gap Summary is the most-read part of the review six months later. Do not
treat it as filler. It is also the direct hand-off if a future
ai-eval-rubric companion skill ships — that skill will grep this section
to seed its eval coverage.
Goal: Produce ai-ux-review.md and ai-ux-review.html, save them to the
resolved review folder, and present them to the user.
ai-ux-review.mdStructure (headings must match exactly):
# AI UX Review — [Product / Feature Name]
> Generated on [YYYY-MM-DD]. Lifecycle: [idea | prototype | development | shipped]. AI type: [LLM | classical-ML | CV | multi-modal | agentic].
## Block 1 — Why AI Here?
- ...
## Block 2 — Mental Model
- ...
## Block 3 — Trust Calibration
- ...
## Block 4 — Feedback & Control
- ...
## Block 5 — Errors & Graceful Failure
- ...
## Block 6 — Output Integrity
- ...
## Block 7 — Success & Evaluation
- ...
---
## Gap Summary
1. **[Gap in one line]**
- Why it matters: ...
- Cheapest design experiment to resolve: ...
The heading anchors ## Block 1...## Block 7 and ## Gap Summary are
load-bearing. If a future ai-eval-rubric companion ships, it will grep
these by name. Do not rename.
ai-ux-review.htmlRead the template at templates/ai-ux-review.html and produce a single
self-contained HTML file that:
[GAP] chip in its header.<resolved-brand-root>/DESIGN.md if it exists,
extracting colors.primary from YAML front matter and binding to
--ai-ux-accent. Otherwise uses neutral defaults.<review-root>/ai-ux-review.md<review-root>/ai-ux-review.htmlCreate the folder if absent.
Use present_files if available; otherwise emit clickable computer://
links. Present the HTML first (visual primary), Markdown second (source of
truth).
End with three lines:
team-composer with
@ai_safety_specialist."Every run ends this way. Do not replace with a "final deliverable" header or meta-commentary.
When ai-ux-review.md already exists at the resolved review root:
<!-- updated YYYY-MM-DD: <reason> -->.This is the loop-back protocol — gaps closed over time are progress, not failure.
<review-root>/ai-ux-review.md Canonical, editable source of truth
<review-root>/ai-ux-review.html Self-contained visual review (primary deliverable)
Where <review-root> resolves per Phase 0.1:
docs/startup-kit/ai-ux/ — orchestrated (smart default or env-var override)docs/ai-ux/ — solo default${STARTUP_KIT_DOCS_ROOT}/ai-ux/ — env-var overrideNo other files. Do not scatter intermediate drafts.
Before presenting to the user, verify each:
Phase 0 (Intake)
validation-canvas.md, DESIGN.md, attached spec)Content (per block)
[Gap — ...] markerCross-block (Phase 2)
Rendering
ai-ux-review.md uses the exact heading structure (so downstream tools can parse)ai-ux-review.html is a single file, opens in a browser, prints cleanly to PDF<brand-root>/DESIGN.md is present; neutral defaults otherwiseShipping
present_files or computer:// links| Skill | When to Use |
|---|---|
ai-eval-review (our own) | Sibling skill. Same shape, different subject — ai-ux-review audits the human-AI design surface; ai-eval-review audits the measurement layer. Block 7 (Success & Evaluation) gaps from this review seed Block 1 of ai-eval-review when both are run. Block 6 boundary is explicit: this skill names designed mitigations (verifiability, provenance, prompt-injection mitigation, autonomy); ai-eval-review Block 6 names measured signal (resistance rates, OOD detection, jailbreak eval). Run both for a complete AI product review. |
validation-canvas (our own) | Upstream when the product is idea-stage and the business model isn't settled. This skill reads validation-canvas.md if present and skips business-model questions. |
brand-workshop (our own) | Upstream when a brand identity exists. This skill reads <brand-root>/DESIGN.md to style the HTML output. Extracts brand tokens from YAML front matter per Google Labs spec (alpha). |
team-composer (our own) | Alternative when the builder wants a discussion on one narrow block (e.g., "let's debate the trust-calibration approach") rather than a full review artifact. Use team-composer with @ux_researcher + @ai_safety_specialist. |
startup-grill (our own) | Adjacent. After this skill ships, the builder may grill the resulting design adversarially before launch. Distinct deliverable (kill report with verdict) vs. this skill's gap summary. |
pitch-deck (our own) | Downstream. If the review surfaces strong design decisions (block-by-block trust + integrity story), they can seed slides 3 and 6 of the deck. |
riskiest-assumption-test (our own) | Composition. If the Gap Summary surfaces design assumptions that need testing (not just deciding), hand the gaps to RAT to convert into falsifiable hypotheses. |
theme-factory (Anthropic) | When the HTML review needs branded styling and no DESIGN.md exists. Apply after content is finalized. |
web-artifacts-builder (Anthropic) | For interactive review variants (filter by block, toggle gaps). Out of scope for v1; natural upgrade path. |
pdf (Anthropic) | When merging the review into a larger packet. The HTML already prints cleanly to PDF; pdf is for programmatic assembly across artifacts. |
docx (Anthropic) | When the review needs to ship as .docx (legal review, regulator submission, board packet). Hand ai-ux-review.md as source. |
ai-safety-mindset (Anthropic) | When the team is missing shared vocabulary for responsible-AI framing — load this skill alongside for Anthropic's HHH framing rather than ad-hoc definitions. |
Principle: this skill owns the design-completeness artifact — whether
the human side of an AI product has been intentionally designed. It does not
validate beliefs (that's validation-canvas), test hypotheses (that's
riskiest-assumption-test), pitch the product (that's pitch-deck),
adversarially probe it (that's startup-grill), or implement evaluation
pipelines (out of scope; future companion).
Graceful degradation: if a referenced skill is not installed, this skill
still ships ai-ux-review.md + ai-ux-review.html. Cross-skill chains are
enhancements, not requirements.
references/blocks/01-why-ai.md — necessity check; AI vs. deterministic alternative; falsifiable successreferences/blocks/02-mental-model.md — user's expected model; misalignment patterns; teaching affordancesreferences/blocks/03-trust-calibration.md — confidence surfacing; under/over-trust risks; explanation depthreferences/blocks/04-feedback-control.md — correction affordances; override paths; autonomy spectrumreferences/blocks/05-errors-graceful-failure.md — failure-mode catalog; severity tiers; recovery pathsreferences/blocks/06-output-integrity.md — hallucination guards; provenance; verifiability; prompt-injection surface; multi-turn drift; agent autonomyreferences/blocks/07-success-evaluation.md — production success criteria; measurable signal vs. proxy; eval gapsRead these when the phase calls for them. Do not front-load all references
at once — that's the progressive disclosure pattern this repo uses (see
CLAUDE.md → "Harness vocabulary").
Tags: ai, ux, human-centered-ai, design-review, llm, responsible-ai
npx claudepluginhub sorawit-w/agent-skills --plugin agent-skillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.