From dlaw
How to run ssmd DQ checks locally and in-cluster, interpret scores, trigger email reports, and verify results. Use when running data quality checks, re-sending DQ emails, or verifying pipeline health after deployments or backfills.
How this skill is triggered — by the user, by Claude, or both
Slash command
/dlaw:ssmd-dq-runThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Procedures for running ssmd Data Quality checks and interpreting results.
Procedures for running ssmd Data Quality checks and interpreting results.
| File | Purpose |
|---|---|
data/dq.py | DQRunner engine — 13 checks, scoring, CLI |
data/dq_email.py | Email report wrapper — runs all feeds, HTML output |
data/Dockerfile | DQ image: python:3.12-slim + duckdb + gcloud monitoring |
Requires gcloud auth application-default login for GCS access.
# Single feed
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto
# With verbose progress
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto --verbose
# JSON output (for programmatic use)
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto --json
# Non-default prefix (when GCS prefix differs from feed name)
uv run data/dq.py --date 2026-02-17 --feed kraken-futures --stream futures --prefix kraken-futures
uv run data/dq.py --date 2026-02-17 --feed polymarket --stream markets --prefix polymarket
Run all feeds in parallel for full pipeline verification:
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto
uv run data/dq.py --date 2026-02-17 --feed kraken-futures --stream futures --prefix kraken-futures
uv run data/dq.py --date 2026-02-17 --feed polymarket --stream markets --prefix polymarket
| Feed | --feed | --stream | --prefix |
|---|---|---|---|
| Kalshi | kalshi | crypto | (default: kalshi) |
| Kraken Futures | kraken-futures | futures | kraken-futures |
| Polymarket | polymarket | markets | polymarket |
The DQ CronJob runs at 03:30 UTC daily (after parquet-gen at 02:00 UTC).
Manifest: clusters/gke-prod/apps/ssmd/cronjobs/dq-daily.yaml
kubectl create job --from=cronjob/ssmd-dq-daily ssmd-dq-manual-MMDD -n ssmd
kubectl logs -n ssmd job/ssmd-dq-manual-MMDD -f
The CronJob defaults to yesterday. To override:
kubectl create job --from=cronjob/ssmd-dq-daily ssmd-dq-rerun-MMDD -n ssmd --dry-run=client -o yaml | \
sed 's|dq_email.py|dq_email.py --date 2026-02-17|' | \
kubectl apply -f -
| Grade | Score Range | Meaning |
|---|---|---|
| GREEN | >= 98 | Pipeline healthy, all checks passing |
| YELLOW | >= 85 | Minor issues, investigate when convenient |
| RED | < 85 | Significant issues, investigate promptly |
| Status | Weight | Meaning |
|---|---|---|
| pass | 1.0 | Check passed |
| warn | 0.7 | Threshold exceeded but not critical |
| fail | 0.0 | Check failed |
| skip | excluded | Not enough data to run, excluded from score |
Score = average of weights * 100.
dq.py exits 1 if any check has status faildq_email.py always exits 0 (email is the alert mechanism)from dq import DQRunner
runner = DQRunner(bucket="ssmd-data", feed="kalshi", stream="crypto")
results = runner.run("2026-02-12")
results.summary() # print human-readable report
results.score() # float 0-100
results.to_json() # JSON string
# Ad-hoc queries via the shared DuckDB connection
runner.con.execute(
"SELECT * FROM read_parquet('gcs://ssmd-data/kalshi/crypto/2026-02-12/ticker_*.parquet') LIMIT 10"
).fetchdf()
# Date range
all_results = runner.run_range("2026-02-10", "2026-02-17")
dq_email.py runs all 3 feeds, generates an HTML email with per-feed grades and check details, and sends via SMTP.
Required env vars: SMTP_USER, SMTP_PASS, SMTP_TO
Optional: SMTP_HOST (default: smtp.gmail.com), SMTP_PORT (default: 587)
These are provided in-cluster via the ssmd-smtp-credentials Secret.
After deploying a new DQ version or backfilling parquet data:
kubectl create job --from=cronjob/ssmd-dq-daily ...DQ image is built from data/Dockerfile, triggered by dq-v* tags in the 899bushwick repo (not ssmd).
See the ssmd-deploy skill for full deployment procedure.
npx claudepluginhub aaronwald/dlawskillz --plugin dlawPerforms data quality checks for completeness, uniqueness, freshness, volume, and distribution drift. Generates scorecards and HTML reports for pipelines.
Implements data quality validation with Great Expectations, dbt tests, and data contracts. Use for building data quality pipelines, validation rules, or establishing data contracts.
Validates data quality using Great Expectations, dbt tests, and data contracts for formal rules, expectation suites, checkpoints, and CI/CD pipelines.