Skill

qa

Use when implementation is complete and you need to prove a feature or bugfix works with visual evidence. Triggers on "prove it works", "QA this", "verify with screenshots", "litmus", "show me proof", "evidence that it works", post-implementation verification requests, or any request to visually verify that code changes actually work in the real world. Also use when the user wants to generate a QA proof report with screenshots and logs.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/litmus:qa

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

```

Supporting Files

assets/report-template/index.htmldocs/2026-03-26-litmus-design.mddocs/2026-03-26-litmus-implementation-plan.mdevals/evals.jsonevals/sample-data.jsonprompts/executor.mdprompts/fixer.mdprompts/plan-reviewer.mdprompts/planner.mdprompts/report-reviewer.mdprompts/reporter.mdreferences/confidence-levels.mdreferences/data-schema.jsonreferences/drivers.mdscripts/litmus.pyscripts/test_litmus.py

SKILL.md

382 lines · ~3.6k tokens

Stats

LanguageHTML

Parent stars0

MaintenanceFair

Last CommitMar 27, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Litmus — Proof-Driven QA Orchestrator

Iron Law

The goal is not to test. The goal is to prove.
Every scenario must produce evidence that would convince a skeptical stranger
that the implementation fully satisfies its spec and works in the real world.
"It passes" is not proof. Screenshots, logs, and behavioral evidence are proof.

Quick Start

/litmus
/litmus --env staging --url https://staging.app.com
/litmus --driver curl --url http://localhost:8080
/litmus --plan docs/qa-plan.md --spec docs/feature-spec.md
/litmus --parallelism 2

Parameters:

Parameter	Default	Options
`--env`	`local`	`local`, `staging`, `prod`
`--url`	auto-detect	`http://localhost:3000`, etc.
`--driver`	`agent-browser`	`agent-browser`, `agent-browser-headed`, `chrome-extension`, `playwright`, `playwright-headed`, `curl`
`--plan`	auto-detect	path to QA plan file
`--spec`	auto-detect	path to feature/bugfix spec
`--parallelism`	`3`	number of parallel subagents per fan-out

Phase 1: Context Gathering (inline — do not dispatch subagents)

Answer: "What are we proving, in what environment, with what driver?"

1a. Read what changed

git diff HEAD~1 --stat
git diff HEAD~1

If no recent commits, check staged changes: git diff --cached

1b. Resolve plan/spec (in order)

Explicit params — --plan / --spec provided? Use them.
Conversation context — was a plan or spec discussed earlier this session? Use it.
Filesystem scan — look in docs/, .litmus/, project root for plan/spec files.
Git diff — read the changes and infer what was implemented.
Fallback — no plan found; planner subagents will generate one from scratch in Phase 2.

1c. Detect environment

Check in order:

package.json → look for scripts.dev to find the dev server command
docker-compose.yml → check exposed ports
.env / .env.local → look for PORT, APP_URL, BASE_URL
Procfile → check web process
Common ports: try curl -s http://localhost:3000, 5173, 8080, 4000

If ambiguous after checks, ask the user before proceeding.

1d. Resolve driver

local env → agent-browser (default)
staging/prod env → agent-browser, but recommend chrome-extension if auth walls are detected
API-only changes (no UI components in git diff) → curl
See references/drivers.md for driver capabilities and limitations.

1e. Initialize session

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py init \
  --project-dir {project_dir} \
  --env {env} \
  --url {url} \
  --driver {driver} \
  --parallelism {parallelism} \
  --description "{short-description}"

Save the session_dir from the JSON output — every subsequent command uses it.

Phase 2: Proof Plan Generation

Dispatch a FRESH subagent (new context window) for each planner.

Fan-out: dispatch --parallelism planner subagents in parallel (default: 3).

Each subagent:

Assembles its prompt:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
  --template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/planner.md \
  --session-dir {session_dir}

Reads the assembled prompt and generates a proof plan.
Writes its plan to disk at a unique path (e.g., {session_dir}/planner-{n}-plan.md).

Orchestrator merges results:

Union all scenarios across all planner outputs.
Deduplicate overlapping scenarios (same claim, same steps).
Resolve conflicts: if planners disagree on steps, keep the more evidence-rich version.

Save merged plan:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py plan save \
  --session-dir {session_dir} < merged-plan.md

Plan is saved to {session_dir}/proof-plan.md.

Phase 3: Adversarial Plan Review (up to 5 rounds)

Every round: dispatch FRESH reviewer subagents (new context window, minimum 2).

Reviewers have no memory of previous rounds — fresh eyes each time.

Each round:

Dispatch --parallelism reviewer subagents in parallel (minimum 2). Each assembles its prompt:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
  --template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/plan-reviewer.md \
  --session-dir {session_dir}

Each reviewer reads the current proof-plan.md and challenges it:
- What edge cases would slip through?
- Which claims have weak or missing evidence requirements?
- Are any scenarios untestable with the selected driver?
- If the implementation were subtly broken, would this plan catch it?
- Are there redundant scenarios wasting execution time?
Each reviewer writes findings to disk.

Orchestrator merges findings:

Union all findings across all reviewers.
Deduplicate overlapping findings (same issue, different wording).

Record the round:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py plan review \
  --session-dir {session_dir} \
  --round {N} < merged-findings.md

If findings exist:

Dispatch a FRESH planner subagent to revise the plan with the findings.
Save the revised plan via plan save.
Increment round counter. Continue.

Convergence: fresh reviewers find nothing new → proceed to Phase 4.

Escalation: after 5 rounds without convergence, surface to the user with current plan and outstanding concerns. Ask whether to proceed or abort.

Phase 4: Evidence Collection

Dispatch a FRESH executor subagent (new context window) per scenario group.

Grouping strategy:

Independent scenarios (no shared state, no ordering dependencies) → fan out across --parallelism executors in parallel.
Scenarios with ordering dependencies or shared state → assign to one executor, run sequentially within that executor.

Each executor:

Assembles its prompt:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
  --template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/executor.md \
  --session-dir {session_dir}

Reads the assembled prompt and executes its assigned scenarios.
For each scenario: navigates, interacts, captures screenshots and logs, assigns a confidence level (see references/confidence-levels.md).

Saves evidence:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py evidence save \
  --session-dir {session_dir} \
  --scenario {scenario-id} \
  --step {n} \
  --description {description} \
  --file {path/to/file}

Writes a structured result JSON to {session_dir}/evidence/{scenario-id}/result.json.

Corroborating evidence principle: A screenshot proves the UI looked right. A screenshot plus server logs plus network logs proves the feature works end-to-end. Always ask: "What additional logs would strengthen this proof?"

After all executors complete: collect all result.json files. Any scenario with confidence below PROVEN enters the fix loop.

Phase 5: Fix + Re-prove Loop (up to 8 rounds)

Run this phase only if Phase 4 produced scenarios below PROVEN.

Each round:

5a. Triage failures

Orchestrator categorizes each failing scenario:

Code bug → dispatch fixer
Environment issue → dispatch fixer with environment context
Plan issue → update plan, mark as not a code failure, skip
Driver limitation → mark UNVERIFIABLE, document why, remove from failure set

5b. Fan-out fixers

Dispatch FRESH fixer subagents (new context window).

Independent bugs (different files/areas) → fix in parallel.
Bugs that touch the same file → fix sequentially.

Each fixer:

Assembles its prompt:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
  --template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/fixer.md \
  --session-dir {session_dir}

Receives: failing scenario, evidence (screenshots + logs), relevant code, spec.
Applies the fix, runs existing tests to check for regressions.
Writes a fix summary to {session_dir}/fix-history/round-{N}/fixes.diff.

5c. Fan-out re-executors

Dispatch FRESH executor subagents (new context window) for previously failing scenarios only.

Re-executors do not know what was fixed — they just collect evidence fresh. Same prompt assembly as Phase 4.

Each re-executor writes updated result.json files.

Write the current failure set:

# Write failures.json listing scenario IDs that are still failing
echo '{"failures": ["scenario-id-1", "scenario-id-2"]}' \
  > {session_dir}/fix-history/round-{N}/failures.json

5d. Check convergence

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py convergence check \
  --session-dir {session_dir}

Exit conditions (in priority order):

All scenarios PROVEN → exit, success
Failure set shrank → progress made, continue to next round
No progress 2 rounds in a row → stop, document surviving failures as findings
Regressions introduced twice → revert last fix, stop
Max 8 rounds reached → stop

Surviving failures become documented findings — not swept under the rug. Write them to {session_dir}/findings.md.

Phase 6: Report Generation

Dispatch a SINGLE reporter subagent (new context window). Never fan-out this phase.

The reporter:

Assembles its prompt:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
  --template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/reporter.md \
  --session-dir {session_dir}

Reads all evidence, scenario results, fix history, and findings from {session_dir}.
Assembles data.json per the schema in references/data-schema.json.
Applies the weakest-link verdict rule: overall verdict = lowest confidence level across all scenarios.
Writes {session_dir}/data.json.

Validate:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py validate \
  --data {session_dir}/data.json \
  --schema ${CLAUDE_PLUGIN_ROOT}/skills/qa/references/data-schema.json

If validation fails, the reporter must re-run to fix the data.json before proceeding.

Assemble report:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py report assemble \
  --session-dir {session_dir}

Serve report:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py serve \
  --session-dir {session_dir}

Note the URL from the JSON output for the report reviewers in Phase 7.

Phase 7: Report Review (up to 3 rounds)

Every round: dispatch FRESH reviewer subagents (new context window, minimum 2).

Each round:

Dispatch --parallelism report reviewer subagents in parallel (minimum 2). Each assembles its prompt:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
  --template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/report-reviewer.md \
  --session-dir {session_dir} \
  --var REPORT_URL={report_url}

Each reviewer evaluates the report through the proof lens:
- Does every PROVEN scenario have evidence that would convince a skeptic?
- Are any confidence levels inflated?
- Is evidence missing, mislabeled, or misleading?
- Are findings presented honestly or buried?
- Could someone unfamiliar with the codebase understand this report?
Each reviewer writes findings to disk.

Orchestrator merges findings (union, deduplicate).

If findings exist:

Dispatch a FRESH reporter subagent to revise data.json.
Re-validate and re-assemble.
Increment round counter. Continue.

Convergence: no findings from fresh reviewers → report is final.

After 3 rounds without convergence, surface outstanding concerns to the user and present the report as-is.

Stop the server when done:

python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py stop \
  --session-dir {session_dir}

Present the final report URL and overall verdict to the user.

Key Principles

Fresh subagents every round. Every subagent dispatch opens a new context window. No accumulated bias, no memory of previous attempts. Disk is the only shared state.

Disk is shared state. All subagents read from and write to {session_dir}. The orchestrator coordinates by reading disk artifacts after each fan-out completes.

Burden of proof. PROVEN means a skeptical stranger would agree. When in doubt, assign PARTIAL — not PROVEN. See references/confidence-levels.md.

Weakest-link verdict. The overall verdict equals the lowest confidence level across all scenarios. One DISPROVEN scenario means the whole report is DISPROVEN.

Evidence over assertions. "It works" is not evidence. Screenshots, logs, network traces, and API responses are evidence. Every scenario must have at least 1 screenshot OR 1 log snippet — zero evidence is a validation failure.

Findings are not failures of the process. Surviving failures after the fix loop become documented findings in the report. They are prominently displayed, not hidden.

Reference Files

File	Purpose
`references/drivers.md`	Driver capabilities, commands, and log capture instructions
`references/confidence-levels.md`	PROVEN / PARTIAL / DISPROVEN / UNVERIFIABLE rubric with examples
`references/data-schema.json`	JSON Schema for data.json validation
`prompts/planner.md`	Planner subagent prompt template (Phase 2)
`prompts/plan-reviewer.md`	Adversarial reviewer prompt template (Phase 3)
`prompts/executor.md`	Executor subagent prompt template (Phase 4)
`prompts/fixer.md`	Fixer subagent prompt template (Phase 5)
`prompts/reporter.md`	Reporter subagent prompt template (Phase 6)
`prompts/report-reviewer.md`	Report reviewer prompt template (Phase 7)
`scripts/litmus.py`	Orchestration script — init, validate, serve, convergence, etc.

qa

Invocation

Context Preview

Supporting Files

SKILL.md

qa

Invocation

Context Preview

Supporting Files

SKILL.md

Litmus — Proof-Driven QA Orchestrator

Iron Law

Quick Start

Phase 1: Context Gathering (inline — do not dispatch subagents)

1a. Read what changed

1b. Resolve plan/spec (in order)

1c. Detect environment

1d. Resolve driver

1e. Initialize session

Phase 2: Proof Plan Generation

Phase 3: Adversarial Plan Review (up to 5 rounds)

Phase 4: Evidence Collection

Phase 5: Fix + Re-prove Loop (up to 8 rounds)

5a. Triage failures

5b. Fan-out fixers

5c. Fan-out re-executors

5d. Check convergence

Phase 6: Report Generation

Phase 7: Report Review (up to 3 rounds)

Key Principles

Reference Files

Similar Skills

Litmus — Proof-Driven QA Orchestrator

Iron Law

Quick Start

Phase 1: Context Gathering (inline — do not dispatch subagents)

1a. Read what changed

1b. Resolve plan/spec (in order)

1c. Detect environment

1d. Resolve driver

1e. Initialize session

Phase 2: Proof Plan Generation

Phase 3: Adversarial Plan Review (up to 5 rounds)

Phase 4: Evidence Collection

Phase 5: Fix + Re-prove Loop (up to 8 rounds)

5a. Triage failures

5b. Fan-out fixers

5c. Fan-out re-executors

5d. Check convergence

Phase 6: Report Generation

Phase 7: Report Review (up to 3 rounds)

Key Principles

Reference Files

Similar Skills