From litmus
Use when implementation is complete and you need to prove a feature or bugfix works with visual evidence. Triggers on "prove it works", "QA this", "verify with screenshots", "litmus", "show me proof", "evidence that it works", post-implementation verification requests, or any request to visually verify that code changes actually work in the real world. Also use when the user wants to generate a QA proof report with screenshots and logs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/litmus:qaThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```
assets/report-template/index.htmldocs/2026-03-26-litmus-design.mddocs/2026-03-26-litmus-implementation-plan.mdevals/evals.jsonevals/sample-data.jsonprompts/executor.mdprompts/fixer.mdprompts/plan-reviewer.mdprompts/planner.mdprompts/report-reviewer.mdprompts/reporter.mdreferences/confidence-levels.mdreferences/data-schema.jsonreferences/drivers.mdscripts/litmus.pyscripts/test_litmus.pyThe goal is not to test. The goal is to prove.
Every scenario must produce evidence that would convince a skeptical stranger
that the implementation fully satisfies its spec and works in the real world.
"It passes" is not proof. Screenshots, logs, and behavioral evidence are proof.
/litmus
/litmus --env staging --url https://staging.app.com
/litmus --driver curl --url http://localhost:8080
/litmus --plan docs/qa-plan.md --spec docs/feature-spec.md
/litmus --parallelism 2
Parameters:
| Parameter | Default | Options |
|---|---|---|
--env | local | local, staging, prod |
--url | auto-detect | http://localhost:3000, etc. |
--driver | agent-browser | agent-browser, agent-browser-headed, chrome-extension, playwright, playwright-headed, curl |
--plan | auto-detect | path to QA plan file |
--spec | auto-detect | path to feature/bugfix spec |
--parallelism | 3 | number of parallel subagents per fan-out |
Answer: "What are we proving, in what environment, with what driver?"
git diff HEAD~1 --stat
git diff HEAD~1
If no recent commits, check staged changes: git diff --cached
--plan / --spec provided? Use them.docs/, .litmus/, project root for plan/spec files.Check in order:
package.json → look for scripts.dev to find the dev server commanddocker-compose.yml → check exposed ports.env / .env.local → look for PORT, APP_URL, BASE_URLProcfile → check web processcurl -s http://localhost:3000, 5173, 8080, 4000If ambiguous after checks, ask the user before proceeding.
local env → agent-browser (default)staging/prod env → agent-browser, but recommend chrome-extension if auth walls are detectedcurlreferences/drivers.md for driver capabilities and limitations.python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py init \
--project-dir {project_dir} \
--env {env} \
--url {url} \
--driver {driver} \
--parallelism {parallelism} \
--description "{short-description}"
Save the session_dir from the JSON output — every subsequent command uses it.
Dispatch a FRESH subagent (new context window) for each planner.
Fan-out: dispatch --parallelism planner subagents in parallel (default: 3).
Each subagent:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
--template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/planner.md \
--session-dir {session_dir}
{session_dir}/planner-{n}-plan.md).Orchestrator merges results:
Save merged plan:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py plan save \
--session-dir {session_dir} < merged-plan.md
Plan is saved to {session_dir}/proof-plan.md.
Every round: dispatch FRESH reviewer subagents (new context window, minimum 2).
Reviewers have no memory of previous rounds — fresh eyes each time.
Each round:
--parallelism reviewer subagents in parallel (minimum 2). Each assembles its prompt:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
--template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/plan-reviewer.md \
--session-dir {session_dir}
proof-plan.md and challenges it:
Orchestrator merges findings:
Record the round:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py plan review \
--session-dir {session_dir} \
--round {N} < merged-findings.md
If findings exist:
plan save.Convergence: fresh reviewers find nothing new → proceed to Phase 4.
Escalation: after 5 rounds without convergence, surface to the user with current plan and outstanding concerns. Ask whether to proceed or abort.
Dispatch a FRESH executor subagent (new context window) per scenario group.
Grouping strategy:
--parallelism executors in parallel.Each executor:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
--template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/executor.md \
--session-dir {session_dir}
references/confidence-levels.md).python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py evidence save \
--session-dir {session_dir} \
--scenario {scenario-id} \
--step {n} \
--description {description} \
--file {path/to/file}
{session_dir}/evidence/{scenario-id}/result.json.Corroborating evidence principle: A screenshot proves the UI looked right. A screenshot plus server logs plus network logs proves the feature works end-to-end. Always ask: "What additional logs would strengthen this proof?"
After all executors complete: collect all result.json files. Any scenario with confidence below PROVEN enters the fix loop.
Run this phase only if Phase 4 produced scenarios below PROVEN.
Each round:
Orchestrator categorizes each failing scenario:
Dispatch FRESH fixer subagents (new context window).
Each fixer:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
--template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/fixer.md \
--session-dir {session_dir}
{session_dir}/fix-history/round-{N}/fixes.diff.Dispatch FRESH executor subagents (new context window) for previously failing scenarios only.
Re-executors do not know what was fixed — they just collect evidence fresh. Same prompt assembly as Phase 4.
Each re-executor writes updated result.json files.
Write the current failure set:
# Write failures.json listing scenario IDs that are still failing
echo '{"failures": ["scenario-id-1", "scenario-id-2"]}' \
> {session_dir}/fix-history/round-{N}/failures.json
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py convergence check \
--session-dir {session_dir}
Exit conditions (in priority order):
Surviving failures become documented findings — not swept under the rug. Write them to {session_dir}/findings.md.
Dispatch a SINGLE reporter subagent (new context window). Never fan-out this phase.
The reporter:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
--template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/reporter.md \
--session-dir {session_dir}
{session_dir}.data.json per the schema in references/data-schema.json.{session_dir}/data.json.Validate:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py validate \
--data {session_dir}/data.json \
--schema ${CLAUDE_PLUGIN_ROOT}/skills/qa/references/data-schema.json
If validation fails, the reporter must re-run to fix the data.json before proceeding.
Assemble report:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py report assemble \
--session-dir {session_dir}
Serve report:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py serve \
--session-dir {session_dir}
Note the URL from the JSON output for the report reviewers in Phase 7.
Every round: dispatch FRESH reviewer subagents (new context window, minimum 2).
Each round:
--parallelism report reviewer subagents in parallel (minimum 2). Each assembles its prompt:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py prompt assemble \
--template ${CLAUDE_PLUGIN_ROOT}/skills/qa/prompts/report-reviewer.md \
--session-dir {session_dir} \
--var REPORT_URL={report_url}
Orchestrator merges findings (union, deduplicate).
If findings exist:
data.json.Convergence: no findings from fresh reviewers → report is final.
After 3 rounds without convergence, surface outstanding concerns to the user and present the report as-is.
Stop the server when done:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/qa/scripts/litmus.py stop \
--session-dir {session_dir}
Present the final report URL and overall verdict to the user.
Fresh subagents every round. Every subagent dispatch opens a new context window. No accumulated bias, no memory of previous attempts. Disk is the only shared state.
Disk is shared state. All subagents read from and write to {session_dir}. The orchestrator coordinates by reading disk artifacts after each fan-out completes.
Burden of proof. PROVEN means a skeptical stranger would agree. When in doubt, assign PARTIAL — not PROVEN. See references/confidence-levels.md.
Weakest-link verdict. The overall verdict equals the lowest confidence level across all scenarios. One DISPROVEN scenario means the whole report is DISPROVEN.
Evidence over assertions. "It works" is not evidence. Screenshots, logs, network traces, and API responses are evidence. Every scenario must have at least 1 screenshot OR 1 log snippet — zero evidence is a validation failure.
Findings are not failures of the process. Surviving failures after the fix loop become documented findings in the report. They are prominently displayed, not hidden.
| File | Purpose |
|---|---|
references/drivers.md | Driver capabilities, commands, and log capture instructions |
references/confidence-levels.md | PROVEN / PARTIAL / DISPROVEN / UNVERIFIABLE rubric with examples |
references/data-schema.json | JSON Schema for data.json validation |
prompts/planner.md | Planner subagent prompt template (Phase 2) |
prompts/plan-reviewer.md | Adversarial reviewer prompt template (Phase 3) |
prompts/executor.md | Executor subagent prompt template (Phase 4) |
prompts/fixer.md | Fixer subagent prompt template (Phase 5) |
prompts/reporter.md | Reporter subagent prompt template (Phase 6) |
prompts/report-reviewer.md | Report reviewer prompt template (Phase 7) |
scripts/litmus.py | Orchestration script — init, validate, serve, convergence, etc. |
Provides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.
npx claudepluginhub foundra-build/foundra-ai-tools --plugin litmus