From dlaw
Catalog of ssmd DQ checks — what each measures, pass/fail criteria, common failure modes and fixes, fanout rules, and accountability SLOs. Use when investigating DQ failures, understanding check behavior, or adding new checks.
How this skill is triggered — by the user, by Claude, or both
Slash command
/dlaw:ssmd-dq-checksThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Reference catalog for all 13 DQ checks in `data/dq.py`.
Reference catalog for all 13 DQ checks in data/dq.py.
What: Verifies parquet files exist and the archiver manifest is present.
| Status | Criteria |
|---|---|
| pass | Parquet files found AND manifest present with files listed |
| warn | Parquet files found but no manifest (or manifest has no files) |
| fail | No parquet files found |
Common failures: Archiver not running, GCS permissions, wrong date.
What: Reports row counts per message type from parquet files. Uses archiver manifest records_by_type or falls back to parquet-gen manifest records_written.
| Status | Criteria |
|---|---|
| pass | At least one parquet file has rows |
| fail | All parquet files are empty (0 rows) |
Note: Polymarket archiver manifest has files: [] — the check falls back to pq_manifest for type breakdown.
What: Validates parquet column names match expected schema per feed/type.
| Status | Criteria |
|---|---|
| pass | All files match expected schema |
| warn | Extra columns present (forward-compatible) |
| fail | Missing required columns |
What: Checks null percentage for required (non-nullable) columns.
| Status | Criteria |
|---|---|
| pass | All required columns have < 1% nulls |
| warn | 1-5% nulls in required columns |
| fail | > 5% nulls in required columns |
What: Checks for duplicate rows by primary key (feed-specific).
| Status | Criteria |
|---|---|
| pass | No duplicates found |
| warn | < 1% duplicate rate |
| fail | >= 1% duplicate rate |
Primary keys: Kalshi trade: trade_id. Kalshi ticker: (market_ticker, ts, _nats_seq).
What: Checks for gaps in _nats_seq (NATS JetStream sequence numbers).
| Status | Criteria |
|---|---|
| pass | No gaps in _nats_seq |
| warn | Gaps present but < 1% of total |
| fail | Gaps >= 1% of total |
Common failures: NATS consumer restart, archiver reconnection.
What: Checks for time gaps in exchange timestamps. Divides the day into 15-minute slots and checks for empty slots.
| Status | Criteria |
|---|---|
| pass | < 5% of 15-min slots empty |
| warn | 5-15% empty |
| fail | > 15% empty |
What: Per-ticker row counts and time coverage. Reports top/bottom tickers by volume.
| Status | Criteria |
|---|---|
| pass | Always passes (informational) |
What: Verifies every JSONL line is accounted for in parquet. The zero-tolerance accountability check.
Sources (in priority order):
parquet-manifest.json) — preferred, scoped to what was actually processedmanifest.json) — fallback for pre-v2.0.0 dataComparison logic:
parquet_rows == jsonl_lines (exact match), or parquet_rows >= jsonl_lines for fanout types| Status | Criteria |
|---|---|
| pass | All data types accounted for |
| warn | Minor discrepancies or coverage < 100% |
| fail | Missing data or unknown types |
Pipeline stats surfaced: lines_json_error, lines_type_unknown, lines_no_schema, parse_batch_dropped
u64 underflow guard: Values > 2^63 in parse_batch_dropped are treated as 0 (known Rust underflow from fanout).
What: Gaps in exchange-provided sequence numbers per file.
Grouping:
_shard_id (v1.3.0+), falls back to sid (v1.2.0), then ticker_colproduct_id| Status | Criteria |
|---|---|
| pass | No gaps in exchange sequences |
| warn | Gaps present but coverage > 95% |
| fail | Coverage < 95% |
Common failures: Kalshi pre-1.3.0 data groups by sid which collides across shards — expect false gaps on old data. Post-1.3.0 groups by _shard_id for accurate per-shard analysis.
What: Percentage of 15-minute time slots (96/day) that have exchange-timestamped data.
| Status | Criteria |
|---|---|
| pass | >= 95% of slots have data |
| warn | 80-95% |
| fail | < 80% |
Feed-specific timestamp columns: Kalshi ts, Kraken time, Polymarket timestamp_ms.
Note: Polymarket coverage may be lower due to activity concentrated in US hours.
What: WebSocket connection uptime from Cloud Monitoring websocket_connected gauge.
| Status | Criteria |
|---|---|
| pass | >= 99% uptime (min across all shards) |
| warn | 95-99% |
| fail | < 95% |
Known issue: Ghost shards from before connector restarts have low data point counts, dragging down the min. The check should consider only the latest shard generation.
Requires: google-cloud-monitoring package, Monitoring Viewer IAM role on the SA.
What: Verifies parquet files report their schema version via metadata.
| Status | Criteria |
|---|---|
| pass | Schema version present and recognized |
| warn | Schema version missing (pre-versioning data) |
| fail | Unknown schema version |
Some message types produce multiple parquet rows per JSONL line (1:N fan-out):
| Feed | Type | Reason |
|---|---|---|
| kraken | ticker | Batch array: one WS message contains multiple symbol updates |
| kraken | trade | Batch array: one WS message contains multiple trades |
| polymarket | price_change | Array-wrapped messages may contain multiple updates |
For fanout types: parquet_rows >= jsonl_lines (not exact match).
For non-fanout types: parquet_rows == jsonl_lines (exact match).
| Feed | Data Types (have parquet schemas) | Control Types (skipped) |
|---|---|---|
| kalshi | ticker, trade, market_lifecycle_v2 | subscribed, ok, error |
| kraken | ticker, trade | heartbeat, status |
| kraken-futures | ticker, trade | control |
| polymarket | book, last_trade_price, price_change, best_bid_ask | new_market, market_resolved |
{
"type": "trade",
"sid": 2,
"seq": 5724,
"msg": {
"trade_id": "uuid",
"market_ticker": "KXBTCD-26FEB0317-T76999.99",
"yes_price": 17,
"count": 130,
"taker_side": "no",
"ts": 1770153448
},
"_shard_id": 3
}
{
"trade_id": "uuid",
"ticker": "KXBTCD-26FEB0317-T76999.99",
"yes_price": 17,
"count": 130,
"taker_side": "no",
"created_time": "2026-02-03T21:17:28.18002Z"
}
def check_<name>(con, base, ...) -> dict to data/dq.py{"check": "<name>", "status": "pass|warn|fail|skip", "detail": "..."}check_fns list in DQRunner.run()dq-v*, update dq-daily.yaml manifest (see ssmd-deploy skill)gs://ssmd-data/{prefix}/{feed}/{stream}/{date}/
manifest.json # archiver manifest (records_by_type, files list)
parquet-manifest.json # parquet-gen manifest (parse_batch_input, records_written)
ticker_000000.parquet # parquet files per type
trade_000000.parquet
...
npx claudepluginhub aaronwald/dlawskillz --plugin dlawImplements data quality validation with Great Expectations, dbt tests, and data contracts. Use for building data quality pipelines, validation rules, or establishing data contracts.
Validates data quality using Great Expectations, dbt tests, and data contracts for formal rules, expectation suites, checkpoints, and CI/CD pipelines.
Manage data quality in DataHub: create and run assertions, check outcomes, raise/resolve incidents, and diagnose health problems across your data estate.