Criterion-related validation
Evidence that scores on a selection procedure are statistically related to one or more measures
of work-relevant behavior or outcomes. The strongest form of "does it predict?" evidence — but
also the most demanding on sample size, criterion quality, and analysis.
Feasibility first
Three things determine whether a criterion-related study is even sensible:
- A relevant, reliable, uncontaminated criterion is available or buildable. Relevance is the
most important property.
- A representative research sample of the workforce/candidate pool exists.
- Adequate statistical power is attainable. Estimate the needed sample size before the study
from the expected effect size, the statistic, and the chosen alpha. Range restriction and
criterion unreliability inflate the corrected coefficient but also inflate its standard error
— so distinguish observed from corrected coefficients when judging power. Underpowered studies
are a chronic failure mode; give Type II error equal attention to Type I.
If any of these is badly deficient, a criterion-related strategy may not be feasible — consider
content-based-validation or generalizing-validity-evidence.
Design: predictive vs. concurrent
- Predictive — predictor collected at/around selection; criterion collected later (after a
retention interval or once performance stabilizes).
- Concurrent — predictor and criterion collected on incumbents at about the same time.
For stable cognitive abilities, predictive and concurrent estimates tend to be comparable. For
noncognitive self-reports (personality, interests, SJTs) and experience-based measures
(biodata), the designs can diverge — e.g., applicant faking motivation differs from incumbents';
biodata responses may reflect on-the-job experience. So results don't automatically generalize
across designs or predictor types. Match the design's inference to your use.
Other design choices that matter: the basis on which sample members were selected, the population
sampled (applicants vs. recent hires vs. fully experienced), and whether you're predicting
higher-level work (acceptable if a substantial share advance and you use criteria at both the
hire level and the higher level).
Criterion development
Choose criteria for relevance, freedom from contamination, and reliability — not availability
or convenience. Criteria should represent important organizational, team, or individual outcomes.
- Relevance — reflects standing on an outcome critical to success (job performance, training
success, turnover, OCBs, advancement). Need not be all-inclusive, but the link to the proposed
use must be clear and rationale documented.
- Contamination — systematic variance unrelated to the construct (machinery quality, sales
territory, rater knowledge of predictor scores, shift, location, rater attitudes). Minimize via
standardized administration; measure and statistically control contaminants where possible.
- Deficiency — excludes relevant variance (a "performance" criterion missing behaviors critical
to performance). If the criterion can't cover the full domain, state what is omitted and the
implication for the inference.
- Criterion bias — systematic error from contamination/deficiency that differentially affects
subgroups. Cannot be detected from criterion scores alone; anticipate and guard against it with
professional judgment.
- Reliability — identify the conditions of measurement (raters, items, occasions) you want to
generalize across and design to estimate the matching reliability. Internal-consistency estimates
may be inadequate for ratings (they ignore rater and time variance).
Common criterion types: supervisory performance ratings (most common; ratings collected for
research are preferable to administrative ratings — Jawahar & Williams, 1997), other performance
indices, archival/HRIS data (verify accuracy, alignment, consistency, and data-privacy
compliance before use).
Choice of predictor
- Predictors need a theoretical, logical, or empirical foundation; specify the rationale before
the study. Verify serendipitous findings (especially from small samples) by independent
replication.
- Base preliminary choices on the work analysis and scientific knowledge, not personal
familiarity or bias.
- Keep the construct vs. method distinction explicit (e.g., "the interview" is a method, not a
construct) to avoid uninterpretable comparisons.
- Address predictor contamination (unstandardized administration, irrelevant content, cheating,
faking — unstructured interviews and unproctored internet tests are higher risk) and predictor
reliability (estimate it for the conditions of intended use).
- When algorithms score structured/unstructured inputs (text, resumes, stimuli responses),
document the conceptual/methodological basis, provide cross-validation evidence, and ensure the
algorithm does not introduce systematic bias against subgroups.
Choice of participants
- The validation sample should represent the situation you'll generalize to (demographics,
motivation, ability, experience). Convenience samples are discouraged to the extent they're
unrepresentative.
- You generally cannot validate separately for every subgroup. Only test for bias when there is
credible evidence of potential bias and sufficient data (adequate power and precision) for the
proper analysis — see
fairness-and-bias-analysis. A subgroup too small to analyze cannot be
compared until more data exist.
Data analysis
- Ensure analyses fit the data (level of measurement, sample size, non-independence/clustering).
Don't pick a method because the software is handy; if you delegate analysis, you retain
responsibility for its suitability and accuracy.
- Report effect sizes, statistical significance, and standard errors / confidence intervals for
predictor–criterion relationships; describe distributions (central tendency, variance) and
interrelationships.
Adjustments (corrections) to validity estimates
- Aim for an unbiased estimate of operational validity. Observed coefficients misestimate the
population value because of range restriction and criterion unreliability; apply suitable
bivariate/multivariate corrections when an appropriate estimate is available.
- Do not correct for predictor unreliability for operational use (you use the actual,
unreliable predictor).
- For ratings criteria, internal-consistency reliability may be inadequate — account for rater and
time variance.
- Corrected coefficients are point estimates: usual significance tests/SEs for unadjusted
coefficients don't apply. Report both corrected and uncorrected values, and use procedures
built for corrected coefficients when testing significance / forming CIs. State explicitly when a
coefficient is theoretical and not the operational validity.
Combining predictors and criteria
- Document the method of combination: multiple regression, Pareto-optimal weights, unit weights,
rational weights, etc. Effective weights ≠ nominal weights (they depend on variances/covariances),
especially when predictors are differentially range-restricted.
- Compensatory vs. noncompensatory combination (and cutoffs/sequencing) affects rank order,
expected mean criterion standing, and subgroup differences — see
selection-decisions-and-scoring.
- Cross-validate composites built from regression (or any data-driven weighting/keying) to guard
against capitalization on chance, especially in small samples. Unit/rational weights don't shrink.
Interpreting results
Interpret against the cumulative research literature. Unusual findings (suppressors, moderators,
nonlinearity, configural scoring, differential weighting of highly correlated predictors) are
suspect — require a very large sample or replication before acting on them.
Pitfalls
- Choosing an available criterion over a relevant one.
- Ignoring power until after data collection.
- Reporting only corrected validities, or applying naive significance tests to them.
- Correcting for predictor unreliability and treating it as operational validity.
- Capitalizing on chance via regression/empirical keying without cross-validation.
- Generalizing concurrent results to a predictive (applicant) use for fakable measures.
Checklist
See also
work-analysis · validation-planning · generalizing-validity-evidence ·
fairness-and-bias-analysis · selection-decisions-and-scoring · technical-validation-report
Source: Principles (5th ed., 2018), "Sources of Validity Evidence → Criterion-Related Evidence"
and "Operational Considerations → Selecting Criterion Measures / Data Analyses."