Skill

Hypothesis Testing

Statistical hypothesis testing framework including null and alternative hypotheses, error types, p-value interpretation, confidence intervals, parametric and non-parametric test selection, and one-sided vs two-sided testing.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/medsyniq-lite:hypothesis-testing

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

The null hypothesis states there is no effect, no difference, or no association. It is the default assumption that the treatment or exposure has no impact.

SKILL.md

204 lines · ~2.6k tokens

Stats

Stars0

MaintenanceGood

Last CommitMar 15, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Hypothesis Testing

Conceptual Framework

Null Hypothesis (H0)

The null hypothesis states there is no effect, no difference, or no association. It is the default assumption that the treatment or exposure has no impact.

H0 for difference in means: mu_1 - mu_2 = 0
H0 for odds ratio: OR = 1
H0 for correlation: rho = 0

The null hypothesis is never "proven" — it is either rejected or not rejected. "Failure to reject" is not the same as "accepting" H0.

Alternative Hypothesis (H1 or Ha)

The alternative hypothesis states that an effect exists. It is what the researcher seeks evidence for.

Two-sided: mu_1 ≠ mu_2 (effect in either direction)
One-sided: mu_1 > mu_2 (effect in a specific direction)

The alternative hypothesis should be specified before data collection.

Error Types

Type I Error (Alpha, False Positive)

Rejecting H0 when it is actually true
Concluding a treatment works when it does not
Controlled by setting the significance level (conventionally alpha = 0.05)
The significance level is the maximum acceptable probability of Type I error

Type II Error (Beta, False Negative)

Failing to reject H0 when it is actually false
Missing a real treatment effect
Related to statistical power: Power = 1 - beta
Conventional beta = 0.20 (power = 80%) or beta = 0.10 (power = 90%)

Relationship Between Errors

For a fixed sample size, reducing alpha increases beta (and vice versa)
Increasing sample size reduces both errors simultaneously
The only free lunch: larger sample sizes improve both Type I and Type II error control

Type III Error (Directional Error)

Correctly rejecting H0 but concluding the effect is in the wrong direction. Relevant primarily in one-sided testing contexts.

P-Value Interpretation

What a P-Value IS

The p-value is the probability of observing data at least as extreme as the actual data, assuming the null hypothesis is true.

P(data this extreme or more | H0 is true)

What a P-Value Is NOT

NOT the probability that H0 is true. P(H0 | data) ≠ p-value. This confusion (the "prosecutor's fallacy" or "transposed conditional") is the single most common misinterpretation in medical research.
NOT the probability that the result is due to chance. The p-value assumes chance (H0) is true and asks about the data; it does not assign a probability to chance being the explanation.
NOT the probability of making an error. The p-value is a property of the data, not of any decision.
NOT a measure of effect size or clinical importance. A tiny, clinically meaningless effect can produce a very small p-value with a large sample.
NOT a measure of evidence strength in an absolute sense. A p-value of 0.04 is not dramatically more "significant" than 0.06.

ASA Statement on P-Values (2016)

Six principles:

P-values indicate how incompatible data are with a specified statistical model
P-values do not measure the probability that the studied hypothesis is true
Scientific conclusions should not be based only on whether a p-value passes a specific threshold
Proper reporting requires full transparency
A p-value does not measure the size or importance of an effect
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis

The 0.05 Threshold

Arbitrary convention (Fisher suggested it as a convenient reference point; Neyman-Pearson formalized it as a decision threshold)
"Statistically significant" does not mean "clinically significant"
Results just above or below 0.05 should be interpreted similarly, not as qualitatively different
Some fields have moved to stricter thresholds (particle physics: 5-sigma; genomics: 5x10^-8)
Consider reporting exact p-values rather than threshold declarations

Confidence Intervals

Interpretation

A 95% confidence interval means: if we repeated the study infinitely and calculated a CI each time, 95% of those intervals would contain the true parameter value.

It does NOT mean: there is a 95% probability that the true value lies within this specific interval (the true value either is or is not in the interval; we do not know which).

Advantages Over P-Values

Provide information about effect magnitude and precision
Width reflects sample size and variability (narrow CI = precise estimate)
CI exclusion of the null value is equivalent to a significant p-value at the corresponding alpha
Can assess both statistical and clinical significance: does the CI include clinically meaningful values?

Relationship to Hypothesis Testing

A 95% CI excluding zero (for differences) or one (for ratios) corresponds to p < 0.05 (two-sided)
The CI provides strictly more information than the p-value
CI can show a "statistically significant" result that is clinically trivial (entire CI within clinically negligible range)
CI can show a "non-significant" result that is clinically compatible with an important effect (wide CI spanning both trivial and important values — an inconclusive study, not a negative study)

Test Selection Framework

Decision Flowchart

Step 1: What is the research question?

Comparison of groups? Association between variables? Agreement?

Step 2: How many groups?

Two groups or more than two?

Step 3: What type of data?

Continuous (interval/ratio)
Ordinal
Categorical (nominal, binary)
Time-to-event

Step 4: Are assumptions met?

Normality (Shapiro-Wilk test, Q-Q plots, histogram inspection)
Equal variances (Levene's test, F-test)
Independence of observations
Adequate sample size

Parametric Tests (Assume Normal Distribution)

Scenario	Test
One sample vs known value	One-sample t-test
Two independent groups, continuous	Independent samples t-test (Welch's preferred)
Two paired/matched groups	Paired t-test
>2 independent groups	One-way ANOVA
>2 related groups	Repeated measures ANOVA
Two factors	Two-way ANOVA
Continuous predictor/outcome	Pearson correlation, linear regression

Non-Parametric Tests (No Distribution Assumption)

Scenario	Test
One sample vs known value	Wilcoxon signed-rank
Two independent groups	Mann-Whitney U (Wilcoxon rank-sum)
Two paired groups	Wilcoxon signed-rank
>2 independent groups	Kruskal-Wallis
>2 related groups	Friedman test
Correlation	Spearman rank correlation
Two categorical variables	Chi-square, Fisher's exact
Ordered categorical outcome	Cochran-Armitage trend test

When to Use Non-Parametric Tests

Clearly non-normal data (heavy skew, outliers, bounded scales)
Small samples where normality cannot be assessed
Ordinal data (Likert scales — debated, but non-parametric is conservative)
Data with floor/ceiling effects
When the median is a more appropriate summary than the mean

When Parametric Tests Are Robust

t-tests and ANOVA are robust to moderate non-normality when sample sizes are adequate (n > 30 per group, Central Limit Theorem)
Welch's t-test (default) is robust to unequal variances
Log transformation can normalize right-skewed data (common in biology/medicine)

One-Sided vs Two-Sided Testing

Two-Sided (Default)

Tests for a difference in either direction. Used when an effect in either direction is possible and scientifically relevant.

Standard in clinical trials (regulatory expectation)
Alpha split between both tails (0.025 each for alpha = 0.05)
Conservative — requires stronger evidence than one-sided

One-Sided

Tests for a difference in one direction only.

Appropriate only when an effect in the opposite direction is impossible or would be interpreted the same as no effect
Uses full alpha in one tail — easier to achieve "significance"
Requires strong a priori justification
Regulatory agencies generally require two-sided tests

When One-Sided Is Defensible

Non-inferiority trials (inherently one-sided — testing that new treatment is not worse)
Equivalence testing (two one-sided tests — TOST)
Vaccine trials where the question is "does it protect?" (harm from vaccine in terms of the disease endpoint is not a plausible scientific question)
Safety monitoring where only harm is of concern

Common Misuse

Using one-sided tests to convert a "non-significant" two-sided p-value (e.g., 0.07) into a "significant" one-sided result (0.035). This is alpha manipulation and scientifically dishonest if not pre-specified.

Multiple Testing

When multiple hypotheses are tested simultaneously, the probability of at least one Type I error exceeds alpha. With k independent tests at alpha = 0.05:

P(at least one false positive) = 1 - (1 - 0.05)^k

For 20 tests: P = 1 - 0.95^20 = 0.64 (64% chance of at least one false positive).

See the dedicated multiple-comparisons skill for correction methods.

Practical Recommendations

Pre-specify the hypothesis, primary outcome, and analysis method before data collection.
Report exact p-values (p = 0.032, not "p < 0.05"). For very small values, report p < 0.001.
Always report confidence intervals alongside p-values.
Always report effect sizes — they convey clinical meaning that p-values cannot.
Do not equate statistical significance with clinical importance. A large trial can find a statistically significant but clinically trivial effect.
Do not equate non-significance with "no effect." A small trial may miss a real effect (Type II error). Report the CI — if it includes clinically important values, the study is inconclusive, not negative.
Consider the prior probability of the hypothesis. A p = 0.04 for a biologically plausible hypothesis in a well-designed trial is more convincing than p = 0.04 from an implausible exploratory analysis.
Avoid dichotomous thinking. A result with p = 0.049 and one with p = 0.051 are essentially identical in evidential strength.

Hypothesis Testing

Invocation

Context Preview

SKILL.md

Hypothesis Testing

Invocation

Context Preview

SKILL.md

Hypothesis Testing

Conceptual Framework

Null Hypothesis (H0)

Alternative Hypothesis (H1 or Ha)

Error Types

Type I Error (Alpha, False Positive)

Type II Error (Beta, False Negative)

Relationship Between Errors

Type III Error (Directional Error)

P-Value Interpretation

What a P-Value IS

What a P-Value Is NOT

ASA Statement on P-Values (2016)

The 0.05 Threshold

Confidence Intervals

Interpretation

Advantages Over P-Values

Relationship to Hypothesis Testing

Test Selection Framework

Decision Flowchart

Parametric Tests (Assume Normal Distribution)

Non-Parametric Tests (No Distribution Assumption)

When to Use Non-Parametric Tests

When Parametric Tests Are Robust

One-Sided vs Two-Sided Testing

Two-Sided (Default)

One-Sided

When One-Sided Is Defensible

Common Misuse

Multiple Testing

Practical Recommendations

Similar Skills

Hypothesis Testing

Conceptual Framework

Null Hypothesis (H0)

Alternative Hypothesis (H1 or Ha)

Error Types

Type I Error (Alpha, False Positive)

Type II Error (Beta, False Negative)

Relationship Between Errors

Type III Error (Directional Error)

P-Value Interpretation

What a P-Value IS

What a P-Value Is NOT

ASA Statement on P-Values (2016)

The 0.05 Threshold

Confidence Intervals

Interpretation

Advantages Over P-Values

Relationship to Hypothesis Testing

Test Selection Framework

Decision Flowchart

Parametric Tests (Assume Normal Distribution)

Non-Parametric Tests (No Distribution Assumption)

When to Use Non-Parametric Tests

When Parametric Tests Are Robust

One-Sided vs Two-Sided Testing

Two-Sided (Default)

One-Sided

When One-Sided Is Defensible

Common Misuse

Multiple Testing

Practical Recommendations

Similar Skills