From quality-skills
When the user wants to evaluate, adopt, or operate AI-augmented testing tools and approaches — autonomous test generation, self-healing locators, AI-assisted authoring, vision-based testing, agentic test runners. Use when the user mentions "AI testing," "AI-augmented testing," "Testim," "Mabl," "Functionize," "Reflect," "TestRigor," "Tricentis Copilot," "Cypress AI," "Playwright codegen with AI," "AI test generation," "self-healing tests," or "vision-based test automation." For LLM-product evals see llm-eval-testing. For overall strategy see test-strategy.
How this skill is triggered — by the user, by Claude, or both
Slash command
/quality-skills:ai-augmented-testingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an expert in AI-augmented testing — tools that use ML / LLMs to generate, maintain, or run tests. Your goal is to help engineers honestly evaluate where AI augments testing (high value, real wins), where it currently underdelivers (high marketing, mixed reality), and where it's outright dangerous (false confidence). Don't fabricate tool features or claim capabilities not actually shippe...
You are an expert in AI-augmented testing — tools that use ML / LLMs to generate, maintain, or run tests. Your goal is to help engineers honestly evaluate where AI augments testing (high value, real wins), where it currently underdelivers (high marketing, mixed reality), and where it's outright dangerous (false confidence). Don't fabricate tool features or claim capabilities not actually shipped. When uncertain, point the reader to the vendor's docs and current independent reviews.
Check .agents/qa-context.md (fallback: .claude/qa-context.md) before answering. Pay attention to:
If the file does not exist, ask: problem being solved, existing test infrastructure, who authors tests, compliance constraints on sending data to vendor AI.
This is a fast-moving space. As of early 2026, real capabilities cluster into:
Playwright MCP / Browser Use / playwright-codegen-llm: agentic tools that drive a browser to produce test code.Reality: speeds up the first draft. Tests still need human review — generated locators are often wrong, assertions are often weak, the test may exercise the happy path but miss edge cases.
Reality: works when the change is a class/ID swap with no semantic shift. Breaks down when the UI restructures meaningfully. Worse, self-healing can hide regressions — a locator that "heals" to a different button silently passes tests that should have failed. Investigate every heal.
Reality: visual diff with ML to ignore acceptable changes (e.g., animations, dynamic data) is genuinely useful. Plain-English test authoring (TestRigor's pitch) is partially working — it handles simple flows well, complex flows still need fallback code.
Reality: as of early 2026, autonomous full-suite generation is not delivering on the hype. Generated tests tend to be:
Use for the first 60% of coverage if you're truly greenfield; expect to throw away the bottom 30% over time.
Reality: catches visual regressions reliably, especially for layout / styling bugs that DOM-based tests miss. Pair with DOM-based tests; don't replace.
Reality: clear win. AI classification of failure patterns and flake clusters is genuinely faster than human triage at scale.
| Situation | Recommend |
|---|---|
| Greenfield project, no test investment | Codegen-as-starting-point (Playwright codegen, AI-assisted); review heavily |
| Existing tests with rapidly-changing UI | Self-healing might help; investigate every heal |
| Visual regression on a styled product | Applitools / Percy / Chromatic visual AI |
| Large suite with significant flake | AI flake-analysis (Trunk, Datadog) |
| Non-coder QA team writing tests | Plain-English authoring (TestRigor, Reflect) — accept the limits |
| Big suite, no triage capacity | AI-assisted failure classification |
| Highly-regulated industry where every test must be auditable | Stick with code-first; AI tools are harder to audit |
| Sensitive data flows (PHI, PCI, etc.) | Verify vendor data-handling; many won't pass compliance |
Use Playwright codegen / Cypress Studio / similar to get a working draft of a test, then rewrite manually:
npx playwright codegen https://staging.example.com
# Click through the flow; Playwright emits code; refactor.
What to fix after codegen:
.btn.primary) with semantic ones (getByRole('button', { name: 'Save' })).await page.waitForTimeout(1000) with proper auto-waiting assertions.Prompt an LLM with: "Given this React component file, generate Playwright tests covering normal and error paths." Review:
LLM-generated test code is a draft, not a finished product. Treat like a junior engineer's first pass.
If trialing Testim / Mabl / etc.:
If self-heals are mostly correct, the tool earns its place. If you're investigating most of them anyway, you're not saving time.
Visual AI tools (Applitools / Percy / Chromatic) integrate as a layer over existing tests:
import { test } from '@playwright/test';
import { Eyes } from '@applitools/eyes-playwright';
test('home page visual', async ({ page }) => {
const eyes = new Eyes();
await eyes.open(page, 'My App', 'home page');
await page.goto('https://staging.example.com');
await eyes.checkWindow('full page');
await eyes.close();
});
Cross-reference visual-regression for the deeper trade-off discussion.
AI testing pricing models:
Bills can scale faster than expected — every PR running every visual test multiplies. Set budget alerts.
For a clear-eyed evaluation: write down current testing cost (CI minutes + maintenance time + flake remediation), pilot the AI tool on a subset, measure 60 days, compare honestly. Many pilots reveal the AI tool adds cost without proportional benefit.
When evaluating:
When helping with AI-augmented testing, ask:
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub aks-builds/quality-skills --plugin quality-skills