From qa-ml-models
Use Evidently OSS (100+ evaluation metrics, declarative testing API) to detect data drift, target drift, and model performance regression - wired into CI as a gate and into production monitoring as a continuous check. Reports as HTML + JSON for both human review and pipeline assertions.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-ml-models:evidently-monitoringThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Evidently is *"an open-source Python library with over 40+ million
Evidently is "an open-source Python library with over 40+ million downloads. It provides 100+ evaluation metrics, a declarative testing API, and a lightweight visual interface" per Evidently docs.
pip install evidently
See the canonical install snippet at https://docs.evidentlyai.com/snippets/install_evidently_oss for the current pinned version constraints.
The standard pattern compares two datasets:
import pandas as pd
reference_df = pd.read_parquet("reference.parquet")
current_df = pd.read_parquet("current.parquet")
from evidently import Report
from evidently.presets import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
my_eval = report.run(reference_data=reference_df, current_data=current_df)
my_eval.save_html("drift_report.html")
Result: HTML dashboard + structured JSON. Per Evidently docs, the preset bundles per-feature drift detection with sane defaults.
from evidently import Report
from evidently.presets import DataDriftPreset
from evidently.tests import TestColumnDrift
report = Report(
tests=[
DataDriftPreset(),
TestColumnDrift(column="target", stattest="psi", stattest_threshold=0.2),
]
)
result = report.run(reference_data=reference_df, current_data=current_df)
if result.dict()["status"] == "FAIL":
raise SystemExit("Evidently TestSuite failed; see drift_report.html")
stattest options include psi, wasserstein, ks, chisquare,
jensenshannon - pick by data type. PSI is conventional for tabular
production drift.
from evidently.presets import RegressionPreset, ClassificationPreset
# Regression
report = Report(metrics=[RegressionPreset()])
report.run(reference_data=ref, current_data=cur).save_html("regression.html")
# Classification
report = Report(metrics=[ClassificationPreset()])
report.run(reference_data=ref, current_data=cur).save_html("classification.html")
Requires both prediction and target columns in both DataFrames.
# Daily monitoring job
import datetime
from pathlib import Path
today = datetime.date.today().isoformat()
current_df = load_production_window(start=today, days=1)
reference_df = load_reference_window()
report = Report(metrics=[DataDriftPreset()])
result = report.run(reference_data=reference_df, current_data=current_df)
result.save_html(Path(f"monitoring/{today}.html"))
if result.dict()["status"] == "FAIL":
notify_oncall(f"Data drift detected on {today}")
Pair with a scheduler (Airflow / Prefect / cron / Argo Workflows).
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Use yesterday as reference (rolling window only) | Slow drifts go undetected (model degrades 1% per day for 100 days = 100% drift) | Pin a stable reference (Step 2) |
| Run only on training data | Training data is curated; never reflects real production distribution | Use real production samples (Step 6) |
| Default thresholds for all metrics | Defaults are textbook; production tolerance differs | Tune per-feature thresholds (Step 4) |
| Block deploy on every drift | High-traffic production shifts daily; team disables monitor | Severity tiers: critical drift blocks; minor drift alerts |
| Skip target drift | Concept drift (input stable, output behavior changed) goes undetected | Always include TestColumnDrift(column="target") (Step 4) |
fairlearn-fairness skill.npx claudepluginhub testland/qa --plugin qa-ml-modelsProvides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.