From cks
When the user wants to plan, design, or implement an A/B test or experiment, or build a growth experimentation program. Also use when the user mentions 'A/B test,' 'split test,' 'experiment,' 'conversion test,' or 'growth experiment.'
How this skill is triggered — by the user, by Claude, or both
Slash command
/cks:ab-test-setupsonnetThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Expert knowledge for designing statistically valid experiments that produce trustworthy, actionable results.
Expert knowledge for designing statistically valid experiments that produce trustworthy, actionable results.
1. Hypothesis → 2. Metric Selection → 3. Sample Size → 4. Test Design → 5. Run → 6. Interpret → 7. Act
Every step must be completed before the next. Skipping hypothesis formation produces uninterpretable results.
Template: "Because [evidence/observation], we believe [change] will [metric direction] for [audience segment], measured by [primary metric]."
Example: "Because our heatmaps show 60% of users ignore our hero CTA, we believe moving it above the fold will increase free trial signups for new visitors, measured by signup rate."
Good hypotheses have:
Bad hypothesis signals: "Let's test a new headline," "Maybe color X converts better," "Users might prefer Y."
Primary metric: One metric that defines success. The test either wins or loses on this.
Secondary metrics: Supporting signals that explain why the primary moved.
Guardrail metrics: Metrics that must not regress. If they do, the test fails regardless of primary metric lift.
| Metric Type | Example | Purpose |
|---|---|---|
| Primary | Trial signup rate | Win/loss decision |
| Secondary | CTA click rate, page scroll depth | Diagnose cause |
| Guardrail | Bounce rate, page load time | Prevent regressions |
Avoid: Multiple primary metrics. If both need to improve, run two tests.
Required inputs:
Rule of thumb formula:
n = (16 × σ²) / δ²
Where σ = standard deviation, δ = minimum detectable difference.
Online calculators: Evan Miller's A/B test calculator, Optimizely's calculator.
Typical sample sizes for conversion rate tests:
Critical: Calculate sample size BEFORE running. Post-hoc power analysis is invalid.
Frequentist (Classical):
Bayesian:
Recommendation: Use Bayesian when traffic is limited or you need to stop early. Use frequentist when stakes are high and you can commit to the full run.
A/B (or A/B/n): Test one variable across 2+ variants.
Multivariate (MVT): Test multiple variables simultaneously in combinations.
Multi-Armed Bandit: Adaptive allocation — more traffic to better-performing variants over time.
Segment-level analysis is fine AFTER a test completes. But do not start a test with the goal of "finding the best variant for segment X" unless you've pre-specified that segment and sized accordingly (Bonferroni correction applies).
Never stop early based on results. Run for:
Business cycle effects to account for:
A result can be statistically significant (p < 0.05) but practically meaningless (0.1% lift). Always check:
New variants often see an initial uplift from curiosity, then regress. If your test duration is short, results may not hold post-launch. Prefer tests that run for 4+ weeks.
Checking results during a test and stopping when you see p < 0.05 inflates false positive rate dramatically. With α = 0.05, checking at every 10% of collected data gives an actual false positive rate of ~40%.
Solutions:
| Scenario | What It Means | Action |
|---|---|---|
| Winner, practical significance | Test worked | Ship variant, document learnings |
| Winner, no practical significance | Stat artifact or tiny effect | Do not ship unless free; learn from it |
| No winner | No effect detected | Bigger change needed, or hypothesis wrong |
| Variant loses | Baseline is better | Keep control, investigate why |
| Guardrail violated | Variant causes harm | Stop test, keep control |
Score each test idea on 3 dimensions (1-10):
PIE score = (P + I + E) / 3. Run highest scores first.
For every test, record:
Each winning test locks in. Subsequent tests build on previous winners. This is the compound growth engine of CRO.
| Rationalization | Reality |
|---|---|
| "We don't have enough traffic — let's just run it for a week" | Underpowered tests produce noise, not signal. Calculate the required duration first. |
| "The results look good at day 3, let's stop early" | Early stopping with p < 0.05 produces a ~40% false positive rate. Run to completion. |
| "We're testing 5 things at once on this page" | Multivariate tests need 5–10× more traffic. You almost certainly don't have it. |
| "Statistical significance is all we need" | Stat sig tells you the effect is real. Practical significance tells you if it matters. Check both. |
| "Let's test everything and see what works" | No hypothesis = no learning. Even a negative result teaches nothing without a hypothesis. |
| "Our audience is different — 80% confidence is fine" | Lowering your threshold increases false positive rate proportionally. There are no free lunches. |
Provides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.
npx claudepluginhub cardinalconseils/claude-starter --plugin cks