From uxaudit
Runs UX regression testing on running apps: static checks for AI slop, accessibility, Nielsen heuristics; dynamic pixel-only user journey validation. Outputs dashboard of UX issues and UI risk signals with suggestions. Use after building or UI changes.
How this skill is triggered — by the user, by Claude, or both
Slash command
/uxaudit:uxauditThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
A skill for running UX regression testing on a running app through two reader-facing buckets — `UX issues` and `UI risk signals`. The dashboard is meant to be read through those two top-level buckets first; each check inside the `ux` band is also tagged with a `subtype` (`understand` / `decide` / `act` / `recover`) so reviewers can drill into where in the user's flow the failure happens. Every ...
checks/accessibility/axe-core/check.jsonchecks/accessibility/axe-core/run.mjschecks/accessibility/focus-removed/check.jsonchecks/accessibility/focus-removed/detect.pychecks/accessibility/font-hierarchy/check.jsonchecks/accessibility/font-hierarchy/detect.pychecks/accessibility/heading-skip-levels/check.jsonchecks/accessibility/heading-skip-levels/detect.pychecks/ai-slop/coming-soon-text/check.jsonchecks/ai-slop/coming-soon-text/detect.pychecks/ai-slop/coming-soon-text/fixtures/fail/visible-text.txtchecks/ai-slop/coming-soon-text/fixtures/pass/visible-text.txtchecks/ai-slop/cta-trailing-emoji/check.jsonchecks/ai-slop/cta-trailing-emoji/detect.pychecks/ai-slop/cta-trailing-emoji/fixtures/fail/rendered.htmlchecks/ai-slop/cta-trailing-emoji/fixtures/pass/rendered.htmlchecks/ai-slop/default-tagline/check.jsonchecks/ai-slop/default-tagline/detect.pychecks/ai-slop/default-tagline/fixtures/fail/rendered.htmlchecks/ai-slop/default-tagline/fixtures/pass/rendered.htmlA skill for running UX regression testing on a running app through two reader-facing buckets — UX issues and UI risk signals. The dashboard is meant to be read through those two top-level buckets first; each check inside the ux band is also tagged with a subtype (understand / decide / act / recover) so reviewers can drill into where in the user's flow the failure happens. Every judgment-call check is anchored in published, publicly-verifiable research (NN/g, WCAG 2.2, ISO 9241-11/110, Krug, Norman, Cooper, Christensen, peer-reviewed studies). Self-evident anti-patterns (lorem ipsum, "Coming soon" stubs) ship without an academic anchor because they aren't judgment calls. The full allow-list and the "don't invent" rule live in references/knowledge/sources.md. Architecture, subagent isolation, and the L1→L4 evaluation model live in references/concepts/architecture.md.
Most testing tells you whether code works. uxaudit tells you whether the result is usable and wants-to-be-used.
Two layers run side by side:
The catalog covers the floor of craft. The scenario layer covers journey continuity. Neither pretends to predict whether the product is the right product to build — that's a human question.
The dashboard presents findings in two reader-facing buckets:
Each check is also tagged with a category (derived from its directory layout — accessibility, ai-slop, core-experience, desirability, usability) and, when its band is ux, a subtype (understand / decide / act / recover). Reviewers should read the output through the two top-level buckets first; category and subtype are secondary axes the dashboard exposes for drill-down.
The reader-facing hierarchy is:
ux vs ui-riskux band): understand → decide → act → recoverui-risk band): critical → major → minorThe four UX subtypes are the canonical questions a reviewer cares about:
| Subtype | Question |
|---|---|
| Understand | Can the user tell what this is and what they are looking at? |
| Decide | Can the user choose the next action without excess judgment? |
| Act | Can the user move through the task and reach value? |
| Recover | Can the user read system state and recover when needed? |
An older rationale doc, references/concepts/ux-layers.md, frames the same territory using the Usability / Desirability / Retention vocabulary. That naming is kept only as conceptual prose — the check catalog itself has no layer field, only band + subtype. A "Retention" layer is explicitly out of scope: a single session cannot measure whether users come back.
Frontmatter argument-hint lists the surface flags. Their meanings:
| Argument | Default when omitted | Notes |
|---|---|---|
<target> | cwd directory basename | Workspace label only — used for directory naming and the dashboard header. |
--url URL | Auto-detected by the Project Locator (Step 2 of the playbook) reading package.json scripts, vite/next/astro/nuxt configs, .env PORT, Procfile, and Rails/Django/Phoenix defaults. If the declared dev server isn't running, the orchestrator offers to launch it in a background shell with user consent and stops it on exit. | Mutually exclusive with --electron-app. |
--electron-app PATH | (none) | Path to an Electron entry point or package dir. L4 may evaluate it if the Judge can operate the app in pixel-only mode from the same session; otherwise it must escalate or skip honestly. |
--viewport <preset|WxH> | Scout's primary_viewport from project-context.json, falling back to desktop. | See viewport preset table below. Named presets set isMobile: true in Playwright; custom WxH does not. |
--viewports <list> | (single iteration with --viewport) | Comma-separated. Produces one iteration per viewport with suffixed directory names (iteration-N-mobile, …). Mutually exclusive with --viewport. |
--lang ja|en | en | When ja, the orchestrator includes Language: ja in every Scout / Journey-Compiler / Judge / Reconciler / Proposer dispatch prompt body and passes --language ja to aggregate.py. Every uxaudit-* agent handles the language directive inline — translation contract is embedded in the agent system prompt. Machine fields stay verbatim (check_id, source, pass_criteria, rationale, file basenames, route paths) — only narrative fields are translated. L1/L2 catalog detector strings are English-only. |
--scenario-mode locked|refresh|hybrid | hybrid | Controls whether the project's scenario contract (the Scout's project-context.json, treated as an evidence-based hypothesis the human has ratified) is reused or re-synthesized between iterations. locked = pure regression run: reuse the previous iteration's contract verbatim, do not re-run the Scout, give iter-to-iter comparison maximum stability. refresh = throw the contract away and re-synthesize from scratch (use after major IA/product changes when the old contract no longer represents the product). hybrid = keep the existing contract entries stable but let the Scout add new candidate journeys on top (the default — preserves comparability while letting coverage grow). See references/concepts/discovery-and-scope.md Part 0 for the framing rationale. |
--only <categories> | (all categories) | Comma-separated category filter. Valid categories: ai-slop, accessibility, usability, core-experience, desirability. |
--skip <categories> | (none) | Comma-separated category exclusion (same valid set as --only). |
--yes / -y | (off — orchestrator pauses on disambiguation) | Unattended mode. Skip every confirmation prompt and pick safe defaults so the run completes end-to-end without human intervention. The only place the orchestrator currently pauses is Phase 01 Step 3 (dev-server disambiguation / launch consent). Defaults applied with --yes: case C (multiple running) → first auditable+running candidate, sorted by confidence then app_id; case D (one declared, none running) → auto-launch via the existing background launch procedure; case E (multiple declared, none running) → first declared, then auto-launch. Cases F (zero declared) and "native platform without --url" still abort — there is nothing safe to assume. The orchestrator must log the auto-pick ([uxaudit] --yes: auto-picked '<app_id>' (...)) so the chosen target is visible in the run log. |
The iteration metadata records the actual scenario contract mode used:
scenario_modescenario_source_iterationcomparable_to_previousThis is what lets History mode distinguish "same measuring stick, new fix attempt" from "the measuring stick itself changed."
Native macOS/iOS/Android apps and CLIs are out of scope: uxaudit is screenshot- and DOM-driven, so a target without a rendered DOM cannot be audited.
Alternate entry points the orchestrator may be asked to run directly (not slash invocations — these are post-audit views over an existing workspace):
python $UXAUDIT_DIR/scripts/generate_dashboard.py <workspace> --timeline
python $UXAUDIT_DIR/scripts/generate_dashboard.py <ws-a> <ws-b> --compare
python $UXAUDIT_DIR/scripts/generate_dashboard.py --library
$UXAUDIT_DIR is exported by orchestrator playbook Step 0 in plugin / --plugin-dir / standalone-symlink / dev-checkout modes — the resolution chain in references/playbook/00-env-setup.md walks $CLAUDE_SKILL_DIR → $CLAUDE_PLUGIN_ROOT/skills/uxaudit → $HOME/.claude/skills/uxaudit → common dev paths and verifies each candidate by looking for SKILL.md. The dashboard auto-detects mode (audit / history / compare / library) from inputs. See references/modes/library.md for the central registry, archiving, and .uxauditignore.
| Preset | CSS size | DSF | UA | Typical device |
|---|---|---|---|---|
mobile-sm | 375 × 667 | 2 | iPhone | iPhone SE (catches designs that break at small widths) |
mobile | 390 × 844 | 3 | iPhone | iPhone 13 / 14 / 15 / 16 (default mobile) |
tablet | 820 × 1180 | 2 | iPad | iPad (10th gen), iPad Air |
tablet-pro | 1024 × 1366 | 2 | iPad | iPad Pro 12.9" (catches designs that treat tablets as desktop) |
desktop | 1440 × 900 | 1 | Mac | MacBook Air / Pro 13" (default) |
desktop-lg | 1920 × 1080 | 1 | Mac | 1080p external monitor (catches max-width: 1440 bugs) |
Named presets that claim to be "mobile" (mobile-sm, mobile, tablet, tablet-pro) set isMobile: true in Playwright — without this the browser ignores <meta viewport> and lays out at 980 CSS px even inside a 390-wide context, so media queries like @media (max-width: 640px) silently never fire. Both presets and custom WxH go through this logic.
For any device that isn't in the preset list, use --viewport WIDTHxHEIGHT directly (e.g. --viewport 430x932 for iPhone Pro Max, --viewport 744x1133 for iPad Mini). Custom sizes use DSF=1 and a desktop UA — they're a Responsive-Mode equivalent, not a device simulator.
4K monitors are intentionally NOT a preset: modern OS scaling (macOS Retina, Windows 150 %+) renders 4K displays at ≤ 1920 × 1080 CSS px, so a physical-4K preset would duplicate desktop-lg.
The orchestrator follows the literal step-by-step in references/playbook/index.md (which routes to the per-phase files 00-env-setup.md … 06-aggregate-report.md). Read the index before running. The high-level shape:
| Step | What | Run by |
|---|---|---|
| 0 | Environment setup (plugin vs standalone, NODE_PATH, UXAUDIT_LANG) | Bash |
| 1 | Workspace setup + stale dev-server cleanup | Bash |
| 2 | Project locator — detect monorepo, enumerate dev server candidates (per sub-app), probe running state | Subagent (agents/uxaudit-locator.md, sonnet) |
| 3 | URL resolution — use --url if given, else pick a running candidate, else offer to launch in background with user consent. With --yes / -y (unattended mode): all consent / disambiguation prompts are skipped and the orchestrator picks safe defaults instead — see references/playbook/01-discovery.md for the per-case decision tree. | Bash + background shell |
| 4 | Scout — discover purpose, design system, journeys, viewport → project-context.json | Subagent (agents/uxaudit-scout.md, opus) |
| 5 | Resolve viewport (Scout's primary_viewport, abort on native warning) | Bash |
| 6 | Capture target | capture.mjs |
| 7 | Catalog L1/L2 | run_all_checks.py |
| 8 | Build compressed evaluation briefs from project-context.json | generate_evaluation_briefs.py |
| 9 | Journey Compiler — translate Scout's natural-language journeys into executable journey scripts (<iter-dir>/journey-scripts/*.json). One subagent dispatch covers all journeys. | Subagent (agents/uxaudit-journey-compiler.md, sonnet) |
| 10 | L4 capture — fan capture_journey.mjs out across journey-scripts/*.json at 4-way parallelism. Per-journey evidence/{01..99}-*.png + steps.json are written. Verdicts are NOT decided here. | run_all_checks.py (integrated) |
| 11 | L3 + L4 Judge dispatch (NO project context allowed) — L3 reads the static screenshot, L4 reads ONLY the per-journey evidence/ (no browser, no curl, no journey-scripts/). All Judges batched in a single parallel Agent message. | Subagents (agents/uxaudit-l3-judge.md, agents/uxaudit-l4-judge.md, opus) |
| 12 | Completeness gate — verify_completeness.py blocks until every L3/L4 has a real verdict (recognizes the new awaiting evidence-only judge dispatch sentinel) | verify_completeness.py |
| 13 | Result schema gate — validate_results.py blocks on any malformed result.json | validate_results.py |
| 14 | Reconciler (optional) — narrow override of catalog fails when design system justifies | Subagent (agents/uxaudit-reconciler.md, sonnet) |
| 15 | Proposer — cross-check synthesis: cluster fails by root cause, produce ranked fix plan (improvement-proposal.json). First role allowed to see BOTH verdicts and project context. | Subagent (agents/uxaudit-proposer.md, sonnet) |
| 16 | Aggregate → benchmark.json (embeds proposal) | aggregate.py |
| 17 | Dashboard (renders Top fixes section from proposal) | generate_dashboard.py |
| 18 | Report verdict + top 3 fixes + stop any dev server we started | Orchestrator |
Critical isolation rule: L3/L4 Judge dispatches in Step 11 are physically sandboxed via their agent frontmatter tool allowlists — uxaudit-l3-judge has Read, Write only, uxaudit-l4-judge has Read, Write, Glob only. They physically cannot read project-context.json, the spec, the compiled journey-scripts/, or any project file because they have no Bash, no WebFetch, no Grep. The L4 split (Step 9 Compiler + Step 10 capture + Step 11 evidence-only Judge) is what makes "pixel-only at judgment time" mechanically enforceable. The Reconciler in Step 14 is the only place where intent and observation meet. See references/concepts/architecture.md for the rationale and references/subagents/judge-output-format.md for the canonical Judge output contract (kept as a reference document for check authors — the new uxaudit-l3-judge / uxaudit-l4-judge agents embed the contract inline in their system prompts).
/uxaudit is autonomous from L1 through L4. Step 12 blocks the pipeline until every Judge has actually run; Step 13 blocks if any verdict is malformed.
L4's pixel-only guarantee at judgment time is enforced by the capture/judge split:
capture_journey.mjs) executes those scripts and writes screenshots + steps.json. It commits no judgment.journey-scripts/. Verdicts come from what is visible in the PNGs, with steps.json used only to disambiguate ordering and to know whether a screenshot represents a successful step or a captured failure state.Reading order when authoring a new check:
references/authoring/check-authoring.md — contract: when to add a check, the 10 manifest fields, classification (band/subtype) and method (L1–L4) choice, evidence-file policy, common mistakes, end-to-end checklistschemas/check-manifest.schema.json — formal JSON Schema enforced by the pre-flight validatorreferences/authoring/adding-checks.md — L1 / L2 / L3 / L4 worked walkthroughs with the actual check_lib / _target.mjs APIscripts/validate_checks.py + scripts/test_checks.py — run after every editpython scripts/validate_checks.py # pre-flight: validate all check.json
python scripts/test_checks.py # regression: L1 fixtures
python scripts/validate_results.py <iter> # post-flight: every result.json
Three safeguards work together to prevent silent quality drift as the catalog grows:
| Safeguard | Catches | When it runs |
|---|---|---|
validate_checks.py | Manifest drift (typo, wrong enum, missing implementation file, id↔path mismatch) | Pre-flight gate inside run_all_checks.py |
test_checks.py | Logic drift in detect.py (threshold change silently breaks detection) | After every implementation edit; per-check fixtures under checks/<cat>/<name>/fixtures/{pass,fail}/ |
validate_results.py | Result drift after Judge dispatch (malformed result.json, missing suggestion on a fail) | Post-flight gate at Step 13 |
For L3/L4 specifically, every Judge subagent must produce result.json conforming to references/subagents/judge-output-format.md (the canonical contract for check authors). The uxaudit-l3-judge and uxaudit-l4-judge agent definitions embed the contract inline in their system prompts — authors of new L3/L4 checks read the reference document to understand the contract, then write a check-specific prompt.md that only enumerates the finding-type tags and the per-axis rubric. Keep evidence assets minimal: L3 usually attaches zero or one supporting screenshot/crop when it materially helps the reviewer; L4 attaches ordered journey waypoints.
Runtime references — the orchestrator loads these on-demand during a run:
| Reading | When to load |
|---|---|
references/playbook/index.md | Read first when running an audit — phase map + isolation rule + failure modes. Then load only the phase file for your current phase. |
references/playbook/00-env-setup.md … 06-aggregate-report.md | One per phase. Load only the file for the phase you are executing. |
references/knowledge/category-baselines.md | Scout on-demand when discovery sources D1–D4 are thin (category-baseline fallback) |
references/modes/feature-scope.md | Only when --scope feature is set |
references/modes/library.md | Only for --library / --archive / --unarchive post-audit views |
Subagent definitions — Claude Code loads these automatically when the orchestrator dispatches subagent_type="uxaudit-<role>". The orchestrator does NOT re-inject them into prompt bodies:
| Agent file | Role | Model | Tools | Invoked at |
|---|---|---|---|---|
agents/uxaudit-locator.md | Project locator (monorepo / dev-server enumeration) | sonnet | Read, Glob, Grep, Bash, Write | Phase 01 Step 2 |
agents/uxaudit-scout.md | Project-context discovery | opus | Read, Glob, Grep, Write, WebSearch, WebFetch | Phase 01 Step 4 |
agents/uxaudit-journey-compiler.md | NL journey → executable JSON | sonnet | Read, Write, Bash, WebFetch | Phase 03 Step 9 + 9.6 heal |
agents/uxaudit-l3-judge.md | L3-vision Judge | opus | Read, Write | Phase 04 Step 11 |
agents/uxaudit-l4-judge.md | L4-journey Judge (evidence-only) | opus | Read, Write, Glob | Phase 04 Step 11 |
agents/uxaudit-reconciler.md | Catalog-vs-intent override | sonnet | Read, Edit, Write, Glob | Phase 05 Step 14 |
agents/uxaudit-proposer.md | Root-cause synthesis + ranked fix plan | sonnet | Read, Write, Glob | Phase 05 Step 15 |
Subagent reference documents — read by check authors or for understanding contracts; NOT injected at dispatch time:
| Reading | Purpose |
|---|---|
references/subagents/judge-output-format.md | Canonical L3/L4 result.json contract (embedded inline in uxaudit-l3-judge / uxaudit-l4-judge). Read this when authoring a new L3/L4 check's prompt.md. |
references/subagents/language-preamble.md | Translation contract for --lang ja (embedded inline in every uxaudit-* agent). Kept as a reference for the aggregate scripts that still key off its enumeration of translated vs verbatim fields. |
references/subagents/heal-preamble.md | Heal-request contract for validator-triggered re-dispatch (embedded inline in every uxaudit-* agent's "If re-dispatched" section). |
Contributor references — human-only documentation, NOT loaded at runtime. Do not cite from subagent briefs or playbook phase files:
| Reading | Purpose |
|---|---|
references/concepts/architecture.md | Pipeline diagram, Scout/Judge/Reconciler isolation rationale, L1–L4 model, failure modes |
references/concepts/discovery-and-scope.md | Journey discovery, whole-app vs feature-scope, first-time vs returning context |
references/concepts/experience-model.md | Project-shaped UX contract: jobs, first value, key decisions, recovery expectations |
references/concepts/ux-layers.md | Philosophy: what uxaudit measures (Usability + Desirability) and what it does not (Retention) |
references/knowledge/sources.md | The "don't invent" rule + the allow-list of citation sources every judgment-call check is anchored in |
references/authoring/check-authoring.md | Contract + checklist for adding new checks |
references/authoring/adding-checks.md | L1 / L2 / L3 / L4 walkthroughs with the real API |
Schemas + scripts — machine artifacts, not prose:
| Artifact | Purpose |
|---|---|
schemas/project-context.schema.json | Scout output schema (journeys, jobs, experience_model) |
schemas/check-manifest.schema.json | check.json manifest schema, enforced by validate_checks.py |
schemas/check-result.schema.json | result.json schema, enforced by validate_results.py |
schemas/improvement-proposal.schema.json | Proposer output schema |
scripts/validate_project_context.py | Validate Scout output (schema + linkage + coverage warnings) |
scripts/generate_evaluation_briefs.py | Derive compressed Judge inputs from project-context.json |
npx claudepluginhub gotalab/uxaudit --plugin uxauditAudits web app UX by dogfooding as user persona: tracks emotional friction, click efficiency, resilience to back/refresh, return intent. Uses Playwright/Chrome MCP for live sites, outputs ranked reports.
Runs AI-powered adversarial UI testing via the browse CLI — analyzes git diffs, explores full apps, and tests functional correctness, accessibility, responsive layout, and UX heuristics. Use for QA pull requests, auditing accessibility, or exploratory testing.
Drives a real Chrome session against a running web app to find bugs, UX issues, a11y problems, and perf regressions. Outputs structured findings JSON for downstream triage.