From godmode
Guides A/B testing workflows: discovers experiment platforms via SDK detection, designs hypotheses and metrics, calculates sample sizes with power analysis, implements hashing assignments, and applies frequentist/Bayesian stats.
How this skill is triggered — by the user, by Claude, or both
Slash command
/godmode:experimentThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- User invokes `/godmode:experiment`
/godmode:experiment# Detect experimentation SDK
grep -l "statsig\|optimizely\|growthbook\|launchdarkly" \
package.json pyproject.toml 2>/dev/null
# Check for existing experiment configs
find . -name "*experiment*" -o -name "*ab_test*" \
| grep -v node_modules | head -10
EXPERIMENT DISCOVERY:
Baseline metric: <name> = <current value>
Traffic: <daily users hitting the surface>
Platform: none | Statsig | Optimizely | GrowthBook
Risk: low (revenue) | medium (growth) | high (minor UX)
IF no platform: recommend Statsig (best free tier)
IF baseline unknown: measure for 7 days first
IF traffic < 1000/day: experiment may take months
EXPERIMENT DESIGN:
Name: <experiment-slug>
Hypothesis: If we <change>, then <metric> will
<improve> by <magnitude> because <reasoning>.
METRICS:
| Type | Metric | Baseline | Target |
|-----------|-------------|----------|--------|
| Primary | <OEC> | <val> | +<X>% |
| Guardrail | latency p95 | <val> | < +10% |
| Guardrail | error rate | <val> | < +0.1%|
| Guardrail | revenue | <val> | > -2% |
| Secondary | <metric> | <val> | +<X>% |
POWER ANALYSIS:
Alpha: 0.05 (5% false positive rate)
Power: 0.80 (80% detection probability)
Baseline rate: <current conversion — e.g., 3.2%>
MDE: <smallest meaningful change — e.g., 10% rel>
Variants: <2 for A/B, N for multivariate>
THRESHOLDS:
Minimum power: 0.80
Maximum alpha: 0.05
Minimum MDE: what matters to the business
Minimum duration: 7 days (avoid day-of-week bias)
IF sample size > 30 days of traffic: increase MDE
or increase traffic allocation
DETERMINISTIC HASHING (recommended):
hash(user_id + experiment_id) % 10000
0-4999 = Control, 5000-9999 = Treatment
RULES:
Same user always gets same variant
No storage required — computed from hash
Works across client and server
Increasing % adds users, never flips existing ones
DECISION RULES (frequentist):
SHIP: p < 0.05 AND effect >= MDE
AND no guardrail regressions
KILL: p < 0.05 AND effect negative
WAIT: p >= 0.05 AND sample not reached
INCONCLUSIVE: sample reached AND p >= 0.05
WHEN to use Bayesian:
Need probability of being better (not just p-value)
Want continuous monitoring without peeking penalty
Business prefers "95% chance B is better" language
EXPERIMENT RESULTS:
Duration: <start> — <end>
Participants: <N> (Control: <N>, Treatment: <N>)
SRM CHECK:
Expected: 50/50, Actual: <actual>/<actual>
Chi-squared p: <value>
IF p < 0.01: SRM detected — STOP, do not interpret
PRIMARY METRIC:
| Variant | Value | vs Control | p-value | Sig? |
|-----------|-------|-----------|---------|------|
| Control | <val> | — | — | — |
| Treatment | <val> | +<X>% | <p> | Y/N |
GUARDRAILS:
latency p95: <PASS|FAIL> (< +10% threshold)
error rate: <PASS|FAIL> (< +0.1% threshold)
revenue: <PASS|FAIL> (> -2% threshold)
LIFECYCLE:
IDEA → DESIGN → REVIEW → QUEUED → RAMPING →
LIVE → ANALYZING → DECIDED → CLEANUP
RAMP SCHEDULE:
5% → 25% → 50% → 100% (hold 7+ days each)
CLEANUP (within 2 weeks of decision):
Remove losing variant code
Delete feature flag
Archive experiment config
PRE-LAUNCH CHECKLIST:
| Check | Status |
|-----------------------------------|--------|
| Hypothesis specific & falsifiable | ? |
| Primary metric (OEC) defined | ? |
| Guardrails defined | ? |
| Sample size achievable | ? |
| Power >= 0.80 | ? |
| Assignment deterministic & sticky | ? |
| Exposure logged once per user | ? |
| Control unchanged | ? |
| Mutual exclusion for conflicts | ? |
Commit: "experiment: <name> — <N> variants, <metric>, <statistical method>"
Never ask to continue. Loop autonomously until done.
1. SDK: statsig, optimizely, growthbook, launchdarkly
2. Analytics: Amplitude, Mixpanel, Segment
3. Existing: experiment configs, assignment logic
Print: Experiment: {name} — {status}. Split: {control}%/{treatment}%. p-value: {p}. Lift: {lift}%. Verdict: {verdict}.
timestamp experiment metric p_value lift status
KEEP if: validated, implemented, metrics confirmed
DISCARD if: failed validation OR tests broke
OR guardrails tripped
STOP when ALL of:
- All 9 pre-launch checks pass
- Exposure logging fires once per user
- Sample size documented with MDE, alpha, power
- Guardrails defined and monitored
- Cleanup plan scheduled
npx claudepluginhub arbazkhan971/godmodeDesigns statistically sound A/B experiments with pre-registration, power analysis, and guardrail metrics for trustworthy causal evidence.
Guides A/B test setup with mandatory gates for hypothesis validation, metrics definition, sample size estimation, and execution readiness checks.
Guides A/B test setup with hard gates for hypothesis validation, metrics definition, sample size calculation, assumptions checks, and execution readiness. Use before coding experiments.