From qa-experimentation
Workflow-driven skill that builds an A/B test validity checklist from an experiment proposal. Walks through the canonical validity gates (pre-registration of OEC + power calc + guardrails, randomization unit + SRM check, assignment integrity, telemetry correctness, peeking discipline per peeking-problem-reference, novelty/primacy assessment, post-experiment SRM re-check, results-interpretation guardrails per Kohavi et al.) and emits a per-experiment checklist + a sign-off form. Use when launching a new experiment, auditing an existing one, or building experimentation governance. Composes guardrail-metrics-reference + peeking-problem-reference.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-experimentation:ab-test-validity-checklistThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill produces the **pre-flight + post-flight** validity
This skill produces the pre-flight + post-flight validity checklist for an A/B test. Each item is a gate; failing one without explicit acknowledgment invalidates the experiment.
Per Kohavi et al. Trustworthy Online Controlled Experiments (ISBN 978-1108724265): "More than 50% of experiments in practice are invalidated by issues the checklist catches."
The output: a per-experiment markdown checklist + a sign-off form for the experiment owner.
Document before launch:
| Item | What |
|---|---|
| OEC | The single metric (or weighted combination) to improve |
| Power | Expected effect size, sample size, alpha, beta |
| Guardrails | Per guardrail-metrics-reference - list each + threshold |
| Randomization unit | User / session / device / cookie / IP / tenant |
| Allocation | Percentages per arm; rules for ramp-up |
| Look schedule | Pre-declared days; per peeking-problem-reference |
| Sequential method | Fixed / Pocock / O'Brien-Fleming / always-valid |
| Stop-early rules | What signals stop (loss on OEC, blocking guardrail) |
Commit this to the repo as experiments/<id>/proposal.yml.
Any post-launch change requires explicit team approval.
Per Microsoft Experimentation Platform research (KDD 2019 paper "Diagnosing Sample Ratio Mismatch in Online Controlled Experiments"): if the observed allocation (e.g., 50.3% A, 49.7% B) deviates significantly from intended (50% / 50%), the experiment is invalid until root cause is found. SRM signals:
Chi-square test:
χ² = Σ ((observed_i - expected_i)² / expected_i)
For 2 arms at 50/50 with N=1e6 users:
Threshold: p < 0.0001 is the canonical SRM-detection boundary (the chi-square is super-sensitive at large N; this threshold prevents false-positive SRM alarms).
If SRM is detected: stop ship discussion; root-cause first.
Use sample-ratio-mismatch-detector.
Tests for the assignment SDK / service:
| Test | Pattern |
|---|---|
| Determinism | Same (user, experiment) → same arm across calls |
| Sticky assignment | User reassigned only if experiment reconfigured |
| Cross-experiment independence | Assignment to expt A doesn't bias expt B |
| Bot exclusion consistent | If bots filtered, filter applies before assignment |
| Latency | Assignment SDK adds < 5ms to request path |
These tests live in the SDK-specific test skills per
statsig-test,
optimizely-test, etc.
Verify the event firing matches the proposal:
Per peeking-problem-reference:
| Rule | Test |
|---|---|
| If sequential / always-valid: p-value valid at any look | Dashboard p-value uses the valid math |
| If fixed-horizon: no early-stop UI | "Ship" button disabled until N reached |
| If Pocock/OBF: look schedule pre-declared | Dashboards lock looks outside the schedule |
Per Kohavi et al.: users react differently to novel UX. Novelty inflates the early-period effect; primacy depresses it. Mitigation:
Before ship:
| Gate | Pass criterion |
|---|---|
| Pre-registration honoured | OEC / guardrails / unit / schedule unchanged since launch |
| SRM clean | p > 0.0001 on the chi-square (Step 2) |
| OEC significant under the declared method | Sequential / always-valid / fixed-horizon p-value |
| All guardrails within thresholds | Per guardrail-metrics-reference |
| Multiple-comparison corrected | Bonferroni / BH if many metrics |
| Novelty assessment | Effect persisting in week 2+ |
| Segment-stability | Effect direction consistent across major segments (no Simpson's paradox) |
| Trust metric stable | Opt-out / complaint rate not up |
Document each pass in experiments/<id>/result.md with the
specific numbers.
The output of this skill: a markdown checklist + sign-off form.
# Experiment <id> — Validity Checklist
## Pre-registration (signed by: <owner>, date: <YYYY-MM-DD>)
- [ ] OEC declared: <metric>
- [ ] Power calc: N=<X>, alpha=0.05, beta=0.20, MDE=<Y>%
- [ ] Guardrails declared: <list with thresholds>
- [ ] Randomization unit: <user_id / device_id>
- [ ] Allocation: <50/50>
- [ ] Look schedule: <Pocock 5 looks at days 2,4,7,10,14>
- [ ] Stop-early rules: <on OEC reaching alpha-threshold>
## During experiment
- [ ] SRM check: chi-square p > 0.0001 ([result: <p>])
- [ ] Assignment integrity tests passing
- [ ] Telemetry validated
## Post-experiment (signed by: <reviewer>, date: <YYYY-MM-DD>)
- [ ] Pre-registration honoured (no scope changes)
- [ ] SRM final check: p > 0.0001 ([result])
- [ ] OEC significant (p=<X>; method: <Pocock>)
- [ ] All guardrails within thresholds:
- api_p95_latency: +<X>% / +<Y>ms — <status>
- dau: <X>% — <status>
- [ ] Multiple-comparison adjusted (method: <Bonferroni / BH>)
- [ ] Novelty assessment: effect persists week 2+? <yes / no>
- [ ] Segment stability: direction consistent? <yes / no>
- [ ] Trust metric stable? <yes / no>
## Ship decision: <ship / no-ship / extend>
Reasoning: <one paragraph>
Sign-off: <name>, <date>
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Post-hoc OEC change | "We found a better metric" = p-hacking | Pre-register |
| Skip SRM check | Invalidates results without detection | Always run chi-square pre-ship |
| Decision before checklist completion | Ship-then-validate is rejected by trusted-experiments framework | Block ship on incomplete checklist |
| Reviewer = experiment owner | Self-sign-off; no second pair of eyes | Different sign-off than owner |
| Skip novelty assessment | Effect disappears post-ship | Look at week-2+ subset |
| Skip segment stability | Simpson's paradox: total positive, per-segment negative | Audit by major segments |
| Treat the checklist as paperwork | Items checked without verification | Each item produces evidence (number, link, calc) |
guardrail-metrics-reference,
peeking-problem-reference.statsig-test,
optimizely-test,
vwo-test,
amplitude-experiment-test,
sample-ratio-mismatch-detector.npx claudepluginhub testland/qa --plugin qa-experimentationProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.