From epistemic-skills
Evaluates whether experimental results justify inferential claims using statistical tests, effect sizes, and assumption checks before allowing claims to leave smokes/.
How this skill is triggered — by the user, by Claude, or both
Slash command
/epistemic-skills:statistical-rigorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Related skills:** `/skill:experiment-execution`, `/skill:baseline-reproduction`, `/skill:falsification-review`, `/skill:verification-before-publication`
Related skills:
/skill:experiment-execution,/skill:baseline-reproduction,/skill:falsification-review,/skill:verification-before-publication
Execution produced numbers. This phase decides whether those numbers justify an inferential
sentence. A promising mean is not enough. A dramatic chart is not enough. A small p-value with no
assumption checks, no effect size, and no multiplicity accounting is not enough.
This skill sits after /skill:experiment-execution and before /skill:falsification-review.
Execution measures. Statistical rigor decides whether interpretation is allowed. Falsification
review then attacks the surviving claim. If you skip this phase, you hand the next reviewer a weak
sentence and call the later pain a surprise.
No result leaves smokes/ without statistical justification
If the result has not survived this gate, it stays in smokes/. It does not move into
experiments/{id}/RESULTS.md. It does not move into root RESULTS.md. It does not become a
sentence containing significant, difference, improves, beats, or correlates.
experiments/{id}/smokes/ contains completed runs and you are about to interpret themsignificant, difference, beats, improves, correlates, or relationshipn has been collected and summary statistics now existsmokes//skill:falsification-review/skill:research-question/skill:preregistration/skill:experiment-execution/skill:baseline-reproduction/skill:verification-before-publicationUse the real repo state from src/state/repo.ts. Do not reconstruct the experiment from memory. Do
not invent side bookkeeping.
| Need | API or file | Rule |
|---|---|---|
| Quick repo snapshot | loadRepoState(cwd) | Start from current repo state, not memory |
| Resolve the live hypothesis | loadHypotheses(cwd) + getActiveHypothesis(entries) | Start from an exact id, claim, n, judgeRef, and baselineRef |
| Recover canonical entries | parseHypotheses(content) | Use when HYPOTHESES.md was edited manually and you need the parser result |
| Check required artifacts | fileExists(path) | Missing prereg or smoke artifacts block interpretation |
| Verify judge continuity | getJudgeLock(cwd, id) + computeJudgeHash(judgeRef, id) | Judge drift can invalidate comparisons before statistics even begin |
| Read baseline context | loadBaselines(cwd) + getBaselineAgeDays(entry) | Find the named comparator and check freshness |
| Read spend | getHypothesisSpend(cwd, id), getHypothesisSpendByCategory(cwd, id), getAllHypothesisSpends(cwd) | Distinguish underpowered evidence from finished evidence |
| Normalize registry text | hypothesisToMarkdown(h) + saveHypotheses(cwd, entries) | Only for honest cleanup, never to rewrite the claim around observed data |
| Guard status changes | updateHypothesisStatus(cwd, id, status) | Do not call CONFIRMED here; statistical significance is not confirmation |
Read these artifacts before interpreting a result:
HYPOTHESES.mdBASELINES.mdexperiments/{id}/prereg.mdexperiments/{id}/judge.lockexperiments/{id}/smokes/aggregate.mdexperiments/{id}/smokes/experiments/{id}/RESULTS.mdIDENTIFY -> CHECK ASSUMPTIONS -> SELECT TEST -> COMPUTE EFFECT SIZE
-> CORRECT MULTIPLICITY -> REPORT EXACTLY -> ONLY THEN INTERPRET
Skip one step and you do not have statistical justification. Run both a parametric and non- parametric test and keep the prettier answer and you do not have statistical justification. Report only a p-value and you do not have statistical justification. Count only the comparisons that survived and you do not have statistical justification.
Most fake rigor starts before the test. It starts when prompts, retries, judges, or slices are counted as independent observations when the real unit is user, task, seed family, or document.
loadHypotheses(cwd).getActiveHypothesis(entries) or select the exact id explicitly.experiments/{id}/prereg.md.n meant in the preregistration.baselineRef is the one now under analysis.getJudgeLock(cwd, id) still matches computeJudgeHash(judgeRef, id) before trusting the numbers.Before interpreting results, verify normality, homogeneity of variance, independence of observations, and the absence of ceiling or floor distortion. Parametric tests are not defaults. They are contracts with assumptions.
Use this rule exactly:
n < 50, check normality with Shapiro-Wilkn ≥ 50, inspect Q-Q plots
Apply the check to the quantity that actually enters the test:n < 50, record the Shapiro-Wilk statistic and exact p-value.n ≥ 50, inspect the Q-Q plot for systematic bends, heavy tails, or strong outliers.If you are comparing independent groups with mean-based tests, variance matters. Different spreads can make pooled summaries lie. Use Levene's test. What to do:
Independence is usually a design fact, not a software checkbox. No p-value rescues non-independent data treated as independent. Check directly from the protocol:
n.If the metric piles up at the maximum or minimum possible value, ordinary mean comparisons can mislead. The absence of movement may be a property of the scale, not of the underlying system. Check:
| Condition | Consequence |
|---|---|
| Normality acceptable, variance acceptable, independence acceptable, no serious ceiling/floor distortion | Parametric path remains eligible |
| Normality doubtful | Use the non-parametric path or justify a robust alternative |
| Variances unequal in an independent-groups design | Use the variance-robust branch or leave the pooled test |
| Independence broken | Block simple inferential claims; redesign or stay descriptive |
| Severe ceiling/floor effects | Interpret cautiously; the metric may not support the claim |
| No assumption memo, no interpretation. |
Pick the test from the data-generating structure, not from field habit. The first question is not "Which test do I know?" The first question is "What design did we actually run?"
| Design | Assumption state | Use this test | Effect size to report |
|---|---|---|---|
| Two independent groups | Normal | t-test | Cohen's d |
| Two independent groups | Non-normal | Mann-Whitney U | rank-based magnitude or a clearly justified alternative |
| Paired data | Normal | paired t-test | Cohen's d for paired differences |
| Paired data | Non-normal | Wilcoxon signed-rank | rank-based magnitude or a clearly justified alternative |
| 3+ groups | Normal | ANOVA + post-hoc | eta-squared |
| 3+ groups | Non-normal | Kruskal-Wallis | rank-based magnitude or a clearly justified alternative |
| Continuous relationship | approximately linear / normal-friendly | Pearson correlation | R^2 or an equivalent variance-explained statement |
| Continuous relationship | monotonic / non-normal | Spearman correlation | report the coefficient plus a disciplined magnitude interpretation |
| If the design is outside this table, stop and choose the right model deliberately. Do not coerce | |||
| count data, bounded proportions, or clustered measurements into a t-test because the table feels | |||
| close enough. |
p < .05Always report effect sizes alongside p-values. A p-value addresses compatibility with a null model under the chosen test. It does not tell you how large the effect is. It does not tell you whether the effect matters. Use these defaults unless the design demands a better-matched magnitude metric:
baselineRef?costCap and current spend from getHypothesisSpend(...)?
If the honest answer is "statistically yes, practically maybe not," say exactly that. Whenever the
toolchain allows it, report confidence intervals around the effect size or mean difference.
Never call:await updateHypothesisStatus(cwd, id, "CONFIRMED");
just because p crossed a threshold. This phase only decides whether inferential language is
statistically supportable.
If you test enough things, one of them will flatter you. Multiplicity correction exists because researchers are predictable. This phase is mandatory when you ran many tests across:
This phase turns arithmetic into repo language. Write so another agent can audit the sentence line by line. Use APA-style reporting. The base pattern is:
t(df)=X.XX, p=.XXX, d=X.XXF(df1, df2)=X.XX, p=.XXX, η²=X.XXr(df)=.XX, p=.XXX, R²=0.XX
Always report exact p-values. Do not write p < .05. Do not round p down to zero. If software
gives 0.0004, report p=.0004.
Use leading zeros for values that can exceed 1:d = 0.74η² = 0.18R² = 0.27M = 1.42SD = 0.91
Do not write d = .74 in this repo. Careless reporting breeds careless review.
Always include:nstrendmarginally significantapproached significancep < .05basically significant
Write the statistical sentence before the headline sentence:experiments/{id}/smokes/ until the claim survives. A
good local note includes the hypothesis id, exact claim, unit of analysis, sample sizes, assumption
checks, chosen test and why, exact test output, effect size, multiplicity family size and
correction, final APA-form sentence, and anything that blocked promotion.| Failure | What actually happened | Correct response |
|---|---|---|
| Chose a t-test by habit | Test selection came from field custom, not design | Re-run Phase 2 from the actual design |
| Counted repeated prompts as independent rows | Inflated n through pseudo-replication | Rebuild the analysis at the correct unit |
Used Shapiro-Wilk at large n and panicked at a tiny deviation | Mistook detectability for materiality | Use Q-Q plots for n ≥ 50 |
| Ignored Levene's test because the means looked clean | Treated variance as cosmetic | Use the variance-robust branch or a different test |
| Ran pairwise tests after ANOVA and forgot multiplicity | Manufactured significance by repetition | Count the full family and correct it |
Reported only p | Hid magnitude | Add the matched effect size |
| Reported only the effect size | Hid uncertainty about distinguishability | Add the test statistic and exact p-value |
Used p < .05 | Replaced evidence with a threshold slogan | Report the exact p-value |
| Declared "no effect" from a non-significant pilot | Confused absence of evidence with evidence of absence | Report uncertainty honestly |
const entries = await loadHypotheses(cwd);
const h = getActiveHypothesis(entries);
if (!h) throw new Error("No OPEN or RUNNING hypothesis.");
Good because statistical review begins from the registered hypothesis, not from the newest folder or your memory.
await updateHypothesisStatus(cwd, h.id, "CONFIRMED");
Bad because confirmation requires more than one threshold crossing.
At the end of this phase, make one decision only:
/skill:falsification-reviewsmokes/smokes/, every item below must be true:n < 50 or Q-Q plots for n ≥ 50CONFIRMED based on statistics alone| Excuse | Reality |
|---|---|
We have n=5 but the effect is huge. | Huge-looking effects at tiny n are exactly where sampling error does its best work. |
Everyone in my field just uses t-tests. | Field habit does not repair violated assumptions or mismatched designs. |
Correcting for multiple comparisons makes nothing significant. | Then nothing survived multiplicity control. That is the result. |
Shapiro-Wilk did not reject, so normality is proven. | Failure to reject is not proof; it is one small piece of evidence. |
At n=200, normality always fails, so diagnostics are pointless. | Large n changes which diagnostic is informative; it does not abolish diagnostics. |
The same users appeared in both groups, but the prompts differed, so it is still independent. | Shared units make the design paired or clustered. |
The p-value is .0497, so I can just say p<.05. | Threshold slogans throw away the actual evidence. Report p=.0497. |
The effect size is small, but statistically significant is statistically significant. | Statistical and practical significance are different questions. |
The corrected p-value only matters for the appendix. | If correction governs the claim, it belongs in the main sentence. |
The metric is capped, but everybody scores near the top, so that is good news. | Ceiling effects can erase the signal you think you are measuring. |
The groups look separated in the plot, so the test is obvious. | Visual separation can be real, noisy, or misleading. Use the matched test. |
We only ran extra slices for intuition. | If the slices could have produced a win sentence, they count in the family. |
Non-parametric tests are weaker, so I should stay parametric. | "Weaker" is not an excuse to violate assumptions. |
It is just an internal result. | Internal falsehood ages into external folklore fast. |
Stop immediately if any of these thoughts show up:
The histogram looks normal enough.The variance difference is probably fine.I can count each slice as another sample.I'll try both tests and report the cleaner one.Pairing makes the analysis annoying.The post-hoc comparisons are obvious, so they do not count.The effect size is optional because the p-value is strong.The p-value is close enough to .05.The correction kills the story.No one will ask how many tests we ran.I can call this confirmed now.The ceiling effect is actually a success effect.
Any one of these means the claim is still negotiating with the gate. Do not negotiate. Repair the
analysis.Most statistical failures do not come from exotic mathematics. They come from ordinary shortcuts:
using the wrong test, ignoring assumptions, treating paired data as independent, hiding effect size,
forgetting how many shots at significance were taken, and reporting threshold slogans instead of the
actual evidence.
This phase matters because it keeps smokes/ provisional until inferential language is earned,
blocks field customs from outranking the actual data structure, forces magnitude and uncertainty to
travel together, and prevents uncorrected p-hacking from becoming repo memory.
A result can fail here and still be valuable. It may teach you that the effect is too small, the
metric is saturated, the sample is underpowered, or the claim needs narrowing. That is not wasted
work. That is the point of rigor.
After this, use /skill:falsification-review.
npx claudepluginhub atomicstrata/epistemic --plugin epistemic-skillsProvides UI/UX resources: 50+ styles, color palettes, font pairings, guidelines, charts for web/mobile across React, Next.js, Vue, Svelte, Tailwind, React Native, Flutter. Aids planning, building, reviewing interfaces.
Fetches up-to-date documentation from Context7 for libraries and frameworks like React, Next.js, Prisma. Use for setup questions, API references, and code examples.