From qa-ml-models
Test ML models with Giskard's scan() vulnerability detector + test catalog (performance, robustness, fairness, data leakage, ethical issues) for tabular and NLP models. Wrap a prediction function in giskard.Model + a DataFrame in giskard.Dataset; emit test suites that pass/fail in CI.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-ml-models:giskard-testsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Giskard wraps any prediction function and DataFrame, then runs a
Giskard wraps any prediction function and DataFrame, then runs a
scan() that surfaces "performance biases, unrobustness, data
leakage, stochasticity, underconfidence, ethical issues" per the
Giskard tabular quickstart.
pip install giskard --upgrade
Per the Giskard tabular quickstart.
from giskard import Dataset
giskard_dataset = Dataset(
df=raw_data,
target=TARGET_COLUMN,
name="Titanic dataset",
cat_columns=CATEGORICAL_COLUMNS,
)
cat_columns matters - Giskard treats categoricals differently for
slicing + drift detection.
from giskard import Model
import numpy as np
import pandas as pd
def prediction_function(df: pd.DataFrame) -> np.ndarray:
preprocessed_df = preprocessing_function(df)
return classifier.predict_proba(preprocessed_df)
giskard_model = Model(
model=prediction_function,
model_type="classification",
name="Titanic model",
classification_labels=classifier.classes_,
feature_names=FEATURE_NAMES,
)
The prediction_function returns probabilities (not class labels)
for classification - required by Giskard's calibration checks.
from giskard import scan
results = scan(giskard_model, giskard_dataset)
results.to_html("scan_report.html")
Per the Giskard tabular quickstart, scan covers categories: performance bias, unrobustness, data leakage, stochasticity, underconfidence, ethical issues. HTML report is artifact-friendly for CI.
test_suite = results.generate_test_suite("My first test suite")
suite_results = test_suite.run()
if not suite_results.passed:
raise SystemExit("Giskard test suite failed; see report")
from giskard import testing
test_suite.add_test(
testing.test_f1(
model=giskard_model,
dataset=giskard_dataset,
threshold=0.7,
)
)
# Slicing test: F1 must hold on a subset
female_slice = giskard_dataset.slice(lambda df: df[df.sex == "female"])
test_suite.add_test(
testing.test_f1(
model=giskard_model,
dataset=female_slice,
threshold=0.65,
)
)
test_suite.run()
Catalog includes test_f1, test_accuracy, test_recall,
test_drift_*, metamorphic transformations. Reference the
Giskard tabular quickstart for the current full list.
- name: Giskard scan
run: |
python ml/giskard_scan.py
# Script raises SystemExit on failure
- name: Upload Giskard report
uses: actions/upload-artifact@v4
with:
name: giskard-report
path: scan_report.html
| Anti-pattern | Why it fails | Fix |
|---|---|---|
Skip cat_columns parameter | Categorical features treated as numeric; bogus drift | Always pass cat_columns (Step 2) |
Wrap a predict() (classes) instead of predict_proba() (probs) | Calibration tests cannot run | Use predict_proba for classification (Step 3) |
| Run scan once, don't add to suite | One-off finding never re-checked | Generate suite from scan (Step 5); CI gates re-run |
| Block CI on every minor scan finding | Noise; team disables Giskard | Set per-test threshold; gate on critical+major only |
| Reuse training dataset for scan | False sense of robustness; scan needs unseen data | Use held-out test split |
scan() is non-deterministic with stochastic models; pin
random_state in the model and Giskard config for reproducible CI.npx claudepluginhub testland/qa --plugin qa-ml-modelsProvides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.