Skill

data-validation

Data quality validation and completeness checks. Use when verifying processed datasets, checking merge quality, or validating sample construction results.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/research-factory:data-validation Dataset to validate and expected properties

User invocable

Model invocable

Inline context

Default effort

Argument hintDataset to validate and expected properties

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- After any data processing or merge step

SKILL.md

70 lines · ~512 tokens

Stats

LanguagePython

Parent stars0

MaintenanceGood

Last CommitJun 10, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Data Validation Protocol

When to Use

After any data processing or merge step
Before running regressions on a new sample
When validating extraction results

Standard Checks

1. Shape and Types

print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(df.dtypes)

2. Missing Values

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
print(missing_pct[missing_pct > 0].sort_values(ascending=False))

3. Duplicates

id_cols = ["firm_id", "date"]  # adjust per dataset
dupes = df.duplicated(subset=id_cols).sum()
print(f"Duplicates on {id_cols}: {dupes}")
assert dupes == 0, f"FAIL: {dupes} duplicate rows on {id_cols}"

4. Value Ranges

# Check key variables are in plausible range
for col in ["returns", "market_cap", "score"]:
    print(f"{col}: min={df[col].min():.4f}, max={df[col].max():.4f}, mean={df[col].mean():.4f}")

5. Panel Balance

obs_per_entity = df.groupby("firm_id").size()
print(f"Entities: {obs_per_entity.nunique()}")
print(f"Obs/entity: min={obs_per_entity.min()}, max={obs_per_entity.max()}, median={obs_per_entity.median()}")

6. Merge Quality (after joins)

print(f"Left only: {(merge_indicator == 'left_only').sum()}")
print(f"Right only: {(merge_indicator == 'right_only').sum()}")
print(f"Both: {(merge_indicator == 'both').sum()}")
match_rate = (merge_indicator == 'both').mean() * 100
print(f"Match rate: {match_rate:.1f}%")

Report Format

=== Data Validation: {dataset_name} ===
Shape: (N, K)
ID columns: [firm_id, date] — 0 duplicates
Missing: col1 (2.3%), col2 (0.1%)
Key ranges: returns [-0.45, 0.82], market_cap [1.2M, 890B]
Panel: 3,456 firms, 2010-2023
VERDICT: PASS / FAIL (reason)

data-validation

Invocation

Context Preview

SKILL.md

data-validation

Invocation

Context Preview

SKILL.md

Data Validation Protocol

When to Use

Standard Checks

1. Shape and Types

2. Missing Values

3. Duplicates

4. Value Ranges

5. Panel Balance

6. Merge Quality (after joins)

Report Format

Similar Skills

Data Validation Protocol

When to Use

Standard Checks

1. Shape and Types

2. Missing Values

3. Duplicates

4. Value Ranges

5. Panel Balance

6. Merge Quality (after joins)

Report Format

Similar Skills