From qa-ml-models
Run Deepchecks suites (data integrity, train-test validation, model evaluation) on tabular / NLP / vision data + models. Pass `result.passed_conditions()` to CI to gate on regressions; the same checks run during research, CI, and production monitoring per the Deepchecks lifecycle posture.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-ml-models:deepchecks-testsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Deepchecks is *"a holistic open-source solution for all of your AI &
Deepchecks is "a holistic open-source solution for all of your AI & ML validation needs" per the Deepchecks welcome. Validates data integrity, train-test splits, model evaluation, end-to-end model development from research through production.
pip install deepchecks
Per the Deepchecks welcome page.
For tabular models:
from deepchecks.tabular import Dataset
train_ds = Dataset(
train_df,
label="target",
cat_features=CATEGORICAL_COLUMNS,
)
test_ds = Dataset(
test_df,
label="target",
cat_features=CATEGORICAL_COLUMNS,
)
cat_features matters for distribution checks. The Vision and NLP
APIs differ - see the Deepchecks welcome section linking
quickstarts for each data type.
from deepchecks.tabular.suites import data_integrity
integrity = data_integrity()
result = integrity.run(train_ds)
result.save_as_html("data_integrity.html")
Catches: duplicate rows, missing values, mixed types, conflicting labels, single-value features, string mismatches.
from deepchecks.tabular.suites import train_test_validation
validation = train_test_validation()
result = validation.run(train_ds, test_ds)
result.save_as_html("train_test_validation.html")
Catches: target drift, feature drift, train-test data leakage, label imbalance, dataset size mismatch.
from deepchecks.tabular.suites import model_evaluation
evaluation = model_evaluation()
result = evaluation.run(train_ds, test_ds, model)
result.save_as_html("model_evaluation.html")
if not result.passed_conditions():
raise SystemExit("Deepchecks model evaluation failed")
Catches: performance regression vs baseline, weak segments, calibration issues, prediction drift between train and test.
from deepchecks.tabular.checks import FeatureDrift
check = FeatureDrift().add_condition_drift_score_less_than(
max_allowed_categorical_score=0.2,
max_allowed_numeric_score=0.1,
)
result = check.run(train_ds, test_ds)
if not result.passed_conditions():
print(result.value)
raise SystemExit("FeatureDrift failed threshold")
Each check has add_condition_* methods; chain them for per-check
gating.
- name: Deepchecks suite
run: |
python ml/deepchecks_suite.py
- name: Upload Deepchecks reports
uses: actions/upload-artifact@v4
with:
name: deepchecks-reports
path: "*.html"
| Anti-pattern | Why it fails | Fix |
|---|---|---|
Skip cat_features | All categorical checks misfire | Always specify (Step 2) |
Skip data_integrity suite | Train on leaky / dup-heavy data | Run before train_test_validation (Step 3) |
| Block CI on every check | Hundreds of warning conditions; team disables | Define per-check thresholds (Step 6); gate Critical only |
| Re-run on the SAME test split each PR | Fixed split → fixed results; no drift detection | Use rolling/cross-validation splits |
| Reuse training data as "current" for production monitoring | Always passes drift; blind to real drift | Use real production samples |
add_condition_* methods (Step 6).npx claudepluginhub testland/qa --plugin qa-ml-modelsProvides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.