Skill

review-experiment

Review experiment implementation code for common A/B testing pitfalls — assignment bugs, logging gaps, bucketing leaks, and feature flag issues.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/ab-testing:review-experiment

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Review experiment implementation code for A/B testing correctness. Accept a file path, code block, or directory from `$ARGUMENTS` or the conversation context.

SKILL.md

91 lines · ~1k tokens

Stats

LanguagePython

Parent stars0

MaintenanceFair

Last CommitMar 19, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Review Categories

1. Assignment Correctness

Is the randomization deterministic and consistent? (Same user always gets same variant)
Is a proper hash function used? (MD5/SHA on user ID + experiment salt, not Math.random())
Is the randomization unit correct? (User-level for most experiments, not request/session-level)
Is the experiment salt unique? (Prevents correlation between experiments)
Is the traffic allocation implemented correctly? (Hash modulo or range check)

2. Bucketing Stability

Can a user switch variants mid-experiment?
- Cookie deletion or expiration
- Logged-in vs. logged-out state
- App updates changing assignment logic
- Server-side vs. client-side assignment inconsistency
Is the assignment cached appropriately for the session/request lifecycle?

3. Exposure Logging

Is the exposure event logged at the point of divergence — the exact moment the user's experience differs?
Is exposure logged BEFORE the feature is shown (not after interaction)?
Is exposure logged for BOTH control and treatment?
Are there code paths where a user is assigned but never exposed (intent-to-treat issues)?
Is the exposure event deduplicated appropriately?

4. Metric Instrumentation

Are conversion events correctly attributed to the variant the user was assigned to?
Is there risk of double-counting conversions?
Are timestamps consistent between exposure and conversion events?
For server-side metrics: is the variant correctly propagated through the request context?

5. Feature Flag Hygiene

Is the feature flag checked consistently across ALL code paths?
Are default/fallback values correct? (What happens if the experiment service is down?)
Are there race conditions between flag evaluation and rendering?
Is the flag evaluated once and reused, or re-evaluated multiple times with potential inconsistency?

6. Performance

Does the experiment add latency to the critical path?
Is the feature flag evaluation on the hot path? Could it be cached?
Are experiment assignments batched or fetched one-at-a-time?
Does the treatment variant load additional resources (JS, images) that could affect performance metrics?

7. Cleanup and Interactions

Is there dead code from previous experiments that should be removed?
Are multiple experiments nesting or interacting unintentionally?
Is the experiment properly scoped to avoid affecting unrelated features?
Is there a clear plan for cleanup after the experiment concludes?

Output Format

Structure your review as:

Summary: One paragraph overall assessment.

Findings:

For each issue found:

[SEVERITY] Title File: path/to/file.py:42

Description of the issue and why it matters.

Fix:
suggested code change

Severity levels:

CRITICAL: Will produce invalid experiment results. Must fix before launch.
WARNING: Risk of data quality issues. Should investigate.
INFO: Best practice suggestion. Nice to fix but not blocking.

Verdict: Overall assessment — safe to launch, needs fixes, or needs redesign.

Guidelines

Focus on issues specific to experimentation. Don't review general code quality unless it directly impacts experiment correctness.
If you can't find the experiment assignment logic, flag that as a critical issue.
If the code uses an experimentation framework/SDK, check that it's being used correctly per its conventions.
When in doubt, flag it. Better to over-report than miss a bucketing bug.

review-experiment

Invocation

Context Preview

SKILL.md

review-experiment

Invocation

Context Preview

SKILL.md

Review Categories

1. Assignment Correctness

2. Bucketing Stability

3. Exposure Logging

4. Metric Instrumentation

5. Feature Flag Hygiene

6. Performance

7. Cleanup and Interactions

Output Format

Guidelines

Similar Skills

Review Categories

1. Assignment Correctness

2. Bucketing Stability

3. Exposure Logging

4. Metric Instrumentation

5. Feature Flag Hygiene

6. Performance

7. Cleanup and Interactions

Output Format

Guidelines

Similar Skills