Skill

criterion-related-validation

Use when designing, conducting, or evaluating a criterion-related validity study — demonstrating an empirical relationship between selection-procedure (predictor) scores and work-relevant criteria. Covers predictive vs. concurrent designs, criterion development (relevance/contamination/deficiency/ reliability/bias), predictor choice, participant sampling, statistical power, data analysis, corrections for range restriction and unreliability, and combining predictors/criteria. Triggers: "criterion validity", "predictive/ concurrent study", "validity coefficient", "correct for range restriction", "criterion measure", "is the test related to performance".

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/personnel-selection:criterion-related-validation

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Evidence that **scores on a selection procedure are statistically related to one or more measures

SKILL.md

172 lines · ~2.6k tokens

Stats

Parent stars1

MaintenanceGood

Last CommitMay 31, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Criterion-related validation

Evidence that scores on a selection procedure are statistically related to one or more measures of work-relevant behavior or outcomes. The strongest form of "does it predict?" evidence — but also the most demanding on sample size, criterion quality, and analysis.

Feasibility first

Three things determine whether a criterion-related study is even sensible:

A relevant, reliable, uncontaminated criterion is available or buildable. Relevance is the most important property.
A representative research sample of the workforce/candidate pool exists.
Adequate statistical power is attainable. Estimate the needed sample size before the study from the expected effect size, the statistic, and the chosen alpha. Range restriction and criterion unreliability inflate the corrected coefficient but also inflate its standard error — so distinguish observed from corrected coefficients when judging power. Underpowered studies are a chronic failure mode; give Type II error equal attention to Type I.

If any of these is badly deficient, a criterion-related strategy may not be feasible — consider content-based-validation or generalizing-validity-evidence.

Design: predictive vs. concurrent

Predictive — predictor collected at/around selection; criterion collected later (after a retention interval or once performance stabilizes).
Concurrent — predictor and criterion collected on incumbents at about the same time.

For stable cognitive abilities, predictive and concurrent estimates tend to be comparable. For noncognitive self-reports (personality, interests, SJTs) and experience-based measures (biodata), the designs can diverge — e.g., applicant faking motivation differs from incumbents'; biodata responses may reflect on-the-job experience. So results don't automatically generalize across designs or predictor types. Match the design's inference to your use.

Other design choices that matter: the basis on which sample members were selected, the population sampled (applicants vs. recent hires vs. fully experienced), and whether you're predicting higher-level work (acceptable if a substantial share advance and you use criteria at both the hire level and the higher level).

Criterion development

Choose criteria for relevance, freedom from contamination, and reliability — not availability or convenience. Criteria should represent important organizational, team, or individual outcomes.

Relevance — reflects standing on an outcome critical to success (job performance, training success, turnover, OCBs, advancement). Need not be all-inclusive, but the link to the proposed use must be clear and rationale documented.
Contamination — systematic variance unrelated to the construct (machinery quality, sales territory, rater knowledge of predictor scores, shift, location, rater attitudes). Minimize via standardized administration; measure and statistically control contaminants where possible.
Deficiency — excludes relevant variance (a "performance" criterion missing behaviors critical to performance). If the criterion can't cover the full domain, state what is omitted and the implication for the inference.
Criterion bias — systematic error from contamination/deficiency that differentially affects subgroups. Cannot be detected from criterion scores alone; anticipate and guard against it with professional judgment.
Reliability — identify the conditions of measurement (raters, items, occasions) you want to generalize across and design to estimate the matching reliability. Internal-consistency estimates may be inadequate for ratings (they ignore rater and time variance).

Common criterion types: supervisory performance ratings (most common; ratings collected for research are preferable to administrative ratings — Jawahar & Williams, 1997), other performance indices, archival/HRIS data (verify accuracy, alignment, consistency, and data-privacy compliance before use).

Choice of predictor

Predictors need a theoretical, logical, or empirical foundation; specify the rationale before the study. Verify serendipitous findings (especially from small samples) by independent replication.
Base preliminary choices on the work analysis and scientific knowledge, not personal familiarity or bias.
Keep the construct vs. method distinction explicit (e.g., "the interview" is a method, not a construct) to avoid uninterpretable comparisons.
Address predictor contamination (unstandardized administration, irrelevant content, cheating, faking — unstructured interviews and unproctored internet tests are higher risk) and predictor reliability (estimate it for the conditions of intended use).
When algorithms score structured/unstructured inputs (text, resumes, stimuli responses), document the conceptual/methodological basis, provide cross-validation evidence, and ensure the algorithm does not introduce systematic bias against subgroups.

Choice of participants

The validation sample should represent the situation you'll generalize to (demographics, motivation, ability, experience). Convenience samples are discouraged to the extent they're unrepresentative.
You generally cannot validate separately for every subgroup. Only test for bias when there is credible evidence of potential bias and sufficient data (adequate power and precision) for the proper analysis — see fairness-and-bias-analysis. A subgroup too small to analyze cannot be compared until more data exist.

Data analysis

Ensure analyses fit the data (level of measurement, sample size, non-independence/clustering). Don't pick a method because the software is handy; if you delegate analysis, you retain responsibility for its suitability and accuracy.
Report effect sizes, statistical significance, and standard errors / confidence intervals for predictor–criterion relationships; describe distributions (central tendency, variance) and interrelationships.

Adjustments (corrections) to validity estimates

Aim for an unbiased estimate of operational validity. Observed coefficients misestimate the population value because of range restriction and criterion unreliability; apply suitable bivariate/multivariate corrections when an appropriate estimate is available.
Do not correct for predictor unreliability for operational use (you use the actual, unreliable predictor).
For ratings criteria, internal-consistency reliability may be inadequate — account for rater and time variance.
Corrected coefficients are point estimates: usual significance tests/SEs for unadjusted coefficients don't apply. Report both corrected and uncorrected values, and use procedures built for corrected coefficients when testing significance / forming CIs. State explicitly when a coefficient is theoretical and not the operational validity.

Combining predictors and criteria

Document the method of combination: multiple regression, Pareto-optimal weights, unit weights, rational weights, etc. Effective weights ≠ nominal weights (they depend on variances/covariances), especially when predictors are differentially range-restricted.
Compensatory vs. noncompensatory combination (and cutoffs/sequencing) affects rank order, expected mean criterion standing, and subgroup differences — see selection-decisions-and-scoring.
Cross-validate composites built from regression (or any data-driven weighting/keying) to guard against capitalization on chance, especially in small samples. Unit/rational weights don't shrink.

Interpreting results

Interpret against the cumulative research literature. Unusual findings (suppressors, moderators, nonlinearity, configural scoring, differential weighting of highly correlated predictors) are suspect — require a very large sample or replication before acting on them.

Pitfalls

Choosing an available criterion over a relevant one.
Ignoring power until after data collection.
Reporting only corrected validities, or applying naive significance tests to them.
Correcting for predictor unreliability and treating it as operational validity.
Capitalizing on chance via regression/empirical keying without cross-validation.
Generalizing concurrent results to a predictive (applicant) use for fakable measures.

criterion-related-validation

Popularity

Invocation

Context Preview

SKILL.md

criterion-related-validation

Popularity

Invocation

Context Preview

SKILL.md

Criterion-related validation

Feasibility first

Design: predictive vs. concurrent

Criterion development

Choice of predictor

Choice of participants

Data analysis

Adjustments (corrections) to validity estimates

Combining predictors and criteria

Interpreting results

Pitfalls

Checklist

See also

Similar Skills

Criterion-related validation

Feasibility first

Design: predictive vs. concurrent

Criterion development

Choice of predictor

Choice of participants

Data analysis

Adjustments (corrections) to validity estimates

Combining predictors and criteria

Interpreting results

Pitfalls

Checklist

See also

Similar Skills