From session-orchestrator
Orchestrates agentic end-to-end tests by resolving target and profile, dispatching the right driver (Playwright for web, peekaboo for macOS), running a UX evaluator, reconciling findings with the issue tracker, and writing reports.
How this skill is triggered — by the user, by Claude, or both
Slash command
/session-orchestrator:test-runnersonnetThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> Project-instruction file resolution: `CLAUDE.md` and `AGENTS.md` (Codex CLI) are transparent aliases — see [skills/_shared/instruction-file-resolution.md](../_shared/instruction-file-resolution.md). Wherever this skill mentions `CLAUDE.md`, the alias rule applies.
Project-instruction file resolution:
CLAUDE.mdandAGENTS.md(Codex CLI) are transparent aliases — see skills/_shared/instruction-file-resolution.md. Wherever this skill mentionsCLAUDE.md, the alias rule applies.
Before anything else, read and internalize soul.md in this skill directory. It defines WHO you are — your role as an orchestrator, your delegation boundaries, and your non-negotiable constraints.
Read skills/_shared/bootstrap-gate.md and execute the gate check. If GATE_CLOSED, invoke skills/bootstrap/SKILL.md and wait for completion. If GATE_OPEN, continue to Phase 1.
Read and parse Session Config per skills/_shared/config-reading.md. Store result as $CONFIG.
Test-runner specific fields (parse these specifically):
test-runner.default-profile (default: smoke)test-runner.retention-days (default: 30)test-command, typecheck-command, lint-command (used for context only — not driven here)Resolution order (first match wins):
--target <name> --profile <name> (explicit, highest priority).orchestrator/policy/test-profiles.json by target name (if present)playwright.config.{ts,js} present → target type web, dispatch playwright-driverPackage.swift present → target type mac, dispatch peekaboo-driver (see skills/peekaboo-driver/SKILL.md)Error: Cannot resolve target — provide --target or add .orchestrator/policy/test-profiles.json
Generate a run ID immediately after target resolution:
import { makeRunId } from 'scripts/lib/test-runner/artifact-paths.mjs';
const runId = makeRunId(); // e.g. "your-target-app-1715688000123"
All artifact paths in subsequent phases derive from this run ID. Never use ad-hoc paths.
After resolution, emit: Test Runner: target=[name] profile=[name] run_id=[runId] driver=[driver]
Determine ${RUN_DIR} from artifact-paths.mjs:runDirPath(runId) before dispatching any driver. All drivers write artifacts under ${RUN_DIR}/.
since_ref is provided)When since_ref is set (passed from the /test --since <git-ref> handoff contract):
changedFilesSince(since_ref) from scripts/lib/discovery/helpers.mjs.[] (no files changed since the ref), emit:
No files changed since <since_ref>. Skipping test run.
and exit with status 0. Do NOT fall back to a full-repo test run.TEST_CHANGED_FILES in the driver subprocess environment (see driver invocations below). Driver-side filtering is deferred — drivers receive the env var but do not yet filter by it in this wave.For each resolved driver:
Dispatch via Bash per skills/playwright-driver/SKILL.md. Pass ${RUN_DIR} so the driver writes all artifacts (screenshots, AX dumps, HAR) under it.
# Example invocation shape (exact flags defined by playwright-driver SKILL.md)
TEST_CHANGED_FILES="${CHANGED_FILES_JSON}" node scripts/lib/playwright-driver/runner.mjs \
--run-dir "${RUN_DIR}" \
--profile "${PROFILE}" \
--target "${TARGET}"
Where ${CHANGED_FILES_JSON} is JSON.stringify(changedFiles) when --since was provided, or an empty string otherwise.
Capture exit code. A non-zero exit from Playwright means test failures — these become findings for the UX evaluator. They are NOT a fatal error for the orchestrator. Continue to Phase 3 regardless of exit code.
Log: playwright-driver exited [code] — [N] test files captured under ${RUN_DIR}
See
skills/peekaboo-driver/SKILL.mdfor the full dispatch contract, permission probe, and artifact layout.
Pre-dispatch platform check: The driver's Phase 1 gate handles the platform and version checks (darwin + macOS 15.0+) and exits 0 (non-fatal skip) on incompatible systems. The orchestrator does not need to replicate these checks.
Permission probe: The driver runs its own Phase 2 permission probe via peekaboo permissions status --json. If required permissions (Screen Recording, Accessibility) are not granted, the driver surfaces an AUQ and exits 2 on failure. The orchestrator treats exit 2 as a driver-framework error, not a test failure.
Invocation:
# All inputs via environment variables — no positional arguments
RUN_DIR="${RUN_DIR}" TARGET="${TARGET}" PROFILE="${PROFILE}" bash skills/peekaboo-driver/SKILL.md
Outputs the orchestrator must parse:
| Artifact | Description |
|---|---|
${RUN_DIR}/exit_code | Plain integer file written by driver before exit |
${RUN_DIR}/results.json | Driver summary: exit_code, scenarios_attempted, scenarios_passed, scenarios_failed |
${RUN_DIR}/ax-snapshots/<scenario>.json | peekaboo AX-tree output per scenario |
${RUN_DIR}/ax-snapshots/glass-modifiers-<ts>.json | Liquid Glass conformance artifact (consumed by ux-evaluator Check 4) |
${RUN_DIR}/screenshots/<step>-<ts>.png | Per-step screenshots (evidence for ux-evaluator findings) |
${RUN_DIR}/console.ndjson | Driver log events as NDJSON |
Exit-code semantics:
| Code | Meaning | Orchestrator Action |
|---|---|---|
| 0 | All captures succeeded (or platform skip) | Record pass or skip, continue to Phase 3 |
| 1 | At least one capture failed | Failures become findings (non-fatal) — continue to Phase 3 |
| 2 | Framework error (missing binary, permission denied, OS mismatch) | Surface as driver error; still continue to Phase 3 with available artifacts |
Capture exit code. Exit 1 (capture failures) produces findings for the UX evaluator — it is NOT fatal for the orchestrator. Exit 2 (framework error) is surfaced in the report but does not halt Phase 3. Continue to Phase 3 regardless of exit code.
Log: peekaboo-driver exited [code] — [N] scenarios captured under ${RUN_DIR}
Invoke the ux-evaluator agent (agents/ux-evaluator.md) via the Agent tool. The agent reads driver artifacts under ${RUN_DIR}/ and applies skills/test-runner/rubric-v1.md (4 checks). The agent writes findings.jsonl directly to ${RUN_DIR}/findings.jsonl — the coordinator does NOT need to forward findings through prompt context.
Agent({
description: `UX evaluate run ${runId}`,
prompt: `<scope: ${RUN_DIR}, rubric: skills/test-runner/rubric-v1.md, output: ${RUN_DIR}/findings.jsonl>`,
subagent_type: "ux-evaluator",
run_in_background: false
})
run_in_background: false is mandatory — Phase 4 depends on findings.jsonl being fully written before reconciliation begins.
After the agent completes, verify ${RUN_DIR}/findings.jsonl exists. If missing, emit a warning and skip Phase 4 (no findings to reconcile).
Read ${RUN_DIR}/findings.jsonl. Use the helpers in scripts/lib/test-runner/issue-reconcile.mjs for all glab/gh interactions — never call glab or gh directly.
| Function | Purpose |
|---|---|
listExistingFindings({glabPath, project, label, maxBuffer}) | Query the tracker for all open from:test-runner issues; returns {ok, issues[], fingerprints: Set} |
createFinding({glabPath, project, fingerprint, title, body, labels, dryRun, maxBuffer}) | Create a new issue; returns {ok, action: 'create', iid?, command?} |
updateFinding({glabPath, project, iid, comment, dryRun, maxBuffer}) | Add a comment to an existing issue; returns {ok, action: 'comment', command?} |
triageDecision(finding, candidates) | Pure decision: fingerprint-exact → ignore; Levenshtein ≤ 2 on title → update; else → create |
reconcileFinding({finding, existingFingerprints, glabPath, dryRun}) | Track-A legacy helper — single-finding create-or-noop using a pre-built fingerprint Set |
Security notes (ADR-364 §C5, #388, #389):
maxBuffer: 4 MB and shell: false.{ok: false, error: {code: 'BODY_TOO_LARGE', ...}}.**Fingerprint:** literals in free-text fields are replaced with __Fingerprint__ before the authoritative sentinel line is appended (sentinel-injection hardening, #388).Before iterating findings, build the dedup set once:
import { listExistingFindings } from 'scripts/lib/test-runner/issue-reconcile.mjs';
const listResult = await listExistingFindings({ glabPath });
// listResult.ok === false → log and continue with empty Set (conservative)
const existingFingerprints = listResult.ok ? listResult.fingerprints : new Set();
const existingIssues = listResult.ok ? listResult.issues : [];
If the glab query fails, log the error and proceed with an empty fingerprint set (conservative: may create duplicate issues, safer than silently skipping).
| Severity | Action |
|---|---|
critical | Auto-create issue — no AUQ. Label: from:test-runner,priority:critical |
high | Auto-create issue — no AUQ. Label: from:test-runner,priority:high |
medium | Batched AUQ triage (see below) |
low | Batched AUQ triage (see below) |
For critical and high, call triageDecision(finding, existingIssues) first. If action === 'ignore', skip. If action === 'update', call updateFinding(). If action === 'create', call createFinding().
Group medium and low findings and present via a single AskUserQuestion call (mirror the discovery skill's Phase 5 Step 3 pattern):
AskUserQuestion({
questions: [{
question: `<N> medium/low findings to triage. How to handle?`,
header: "Test-runner triage",
options: [
{
label: "Create all (Recommended)",
description: "File <N> new issues, all with label from:test-runner"
},
{
label: "Review each",
description: "Walk through findings individually with create/update/ignore"
},
{
label: "Skip all",
description: "Dismiss medium/low; only critical/high get auto-filed"
}
],
multiSelect: false
}]
})
"Create all": For each medium/low finding call triageDecision(finding, existingIssues):
ignore → skip silently.update → call updateFinding({iid: target, comment, ...}).create → call createFinding({fingerprint, title, body, labels, ...})."Review each": Walk through findings one by one. For each, call triageDecision to get the recommended action, then present a per-finding AUQ with three options (Create / Update existing #<iid> / Skip). Honour the user's selection.
"Skip all": Dismiss all medium/low findings without filing. Log skipped: N medium/low findings (user dismissed).
Collect outcomes from reconciliation calls into three buckets:
created[] — issue IIDs of newly created issuescommented[] — issue IIDs of existing issues that received a new commentnoop[] — fingerprints/IIDs where no action was taken (dedup hit or user dismissed)Pass these buckets to Phase 5.
Write ${RUN_DIR}/report.md — a markdown summary of the run. Minimum content:
# Test Run Report
**Run ID:** <runId>
**Target:** <target>
**Profile:** <profile>
**Timestamp:** <ISO 8601>
**Duration:** <ms>ms
## Findings Summary
| Severity | Count |
|----------|-------|
| critical | N |
| high | N |
| medium | N |
| low | N |
## Check Breakdown
| Check | Findings |
|-------|---------|
| <check_name> | N |
## Issue Actions
- **Created:** <IIDs>
- **Commented:** <IIDs>
- **No-op (dedup):** <IIDs>
## Driver Exit Codes
| Driver | Exit Code |
|--------|-----------|
| playwright | N |
Append one record to .orchestrator/metrics/test-runs.jsonl (path from artifact-paths.mjs:jsonlRollupPath()) via appendJsonlAtomic from scripts/lib/autopilot/telemetry.mjs. Never use appendFileSync — POSIX-atomic guarantee is required (per #376).
Schema v1:
{
"schema_version": 1,
"run_id": "your-target-app-1715688000123",
"timestamp": "2026-05-14T07:30:00Z",
"target": "your-target-app",
"profile": "smoke",
"drivers": ["playwright"],
"finding_counts": {"critical": 0, "high": 2, "medium": 5, "low": 3},
"issues": {"created": [388, 389], "commented": [], "noop": [380]},
"exit_code": 0,
"duration_ms": 12345
}
exit_code is the orchestrator's own exit code: 0 if no critical/high findings auto-created, 1 otherwise.
Emit to stdout:
Test Runner complete: run_id=<runId> findings=[total] issues_created=[N] duration=[ms]ms
skills/playwright-driver/SKILL.md, skills/peekaboo-driver/SKILL.md).${RUN_DIR}.issue-reconcile.mjs is mandatory every run.@playwright/mcp for browser drive — R5 hard-gate enforced by scripts/lib/test-runner/check-playwright-mcp-canary.mjs (4× token cost per Microsoft benchmark).glab or gh directly — always go through scripts/lib/test-runner/issue-reconcile.mjs (execFile + binary allowlist + arg-validation, per ADR-364 §C5).appendFileSync — use appendJsonlAtomic from scripts/lib/autopilot/telemetry.mjs.${RUN_DIR} — it is the single source of truth for a run.glab/gh directly — always route through issue-reconcile.mjs.artifact-paths.mjs:makeRunId()).appendJsonlAtomic for JSONL roll-up — never appendFileSync (POSIX-atomic guarantee, per #376).run_in_background: false for the ux-evaluator agent dispatch — Phase 4 depends on its output.| File | Purpose |
|---|---|
soul.md | Orchestrator identity — read before Phase 0 |
rubric-v1.md | UX rubric specification (4 checks, fingerprint formula, findings JSON schema) — read by ux-evaluator agent |
| File | Purpose |
|---|---|
scripts/lib/test-runner/fingerprint.mjs | Stable 16-hex fingerprint per finding (deterministic, idempotent) |
scripts/lib/test-runner/artifact-paths.mjs | Run-dir + artifact path builders — pure functions, no side effects |
scripts/lib/test-runner/issue-reconcile.mjs | glab/gh dispatch via execFile + binary allowlist (ADR-364 §C5) |
npx claudepluginhub kanevry/session-orchestrator --plugin session-orchestratorGenerates and executes E2E tests from specs or Gherkin scenarios, then auto-fixes bugs found in application code. Use after implementation to verify end-to-end behavior.
Orchestrates parallel QA agents, triages bugs, and generates structured QA reports. Entry point for running QA sessions.
Executes web tests via Playwright (Microsoft). Captures AX-tree snapshots, screenshots, and console output under deterministic run directories for orchestrator parsing.