Skill

ssmd-dq-checks

Catalog of ssmd DQ checks — what each measures, pass/fail criteria, common failure modes and fixes, fanout rules, and accountability SLOs. Use when investigating DQ failures, understanding check behavior, or adding new checks.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/dlaw:ssmd-dq-checks

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Reference catalog for all 13 DQ checks in `data/dq.py`.

SKILL.md

248 lines · ~2k tokens

Stats

Stars1

MaintenanceExcellent

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

ssmd-dq-checks

Reference catalog for all 13 DQ checks in data/dq.py.

Check Catalog

Check 1: data_completeness

What: Verifies parquet files exist and the archiver manifest is present.

Status	Criteria
pass	Parquet files found AND manifest present with files listed
warn	Parquet files found but no manifest (or manifest has no files)
fail	No parquet files found

Common failures: Archiver not running, GCS permissions, wrong date.

Check 2: record_counts

What: Reports row counts per message type from parquet files. Uses archiver manifest records_by_type or falls back to parquet-gen manifest records_written.

Status	Criteria
pass	At least one parquet file has rows
fail	All parquet files are empty (0 rows)

Note: Polymarket archiver manifest has files: [] — the check falls back to pq_manifest for type breakdown.

Check 3: schema

What: Validates parquet column names match expected schema per feed/type.

Status	Criteria
pass	All files match expected schema
warn	Extra columns present (forward-compatible)
fail	Missing required columns

Check 4: null_rates

What: Checks null percentage for required (non-nullable) columns.

Status	Criteria
pass	All required columns have < 1% nulls
warn	1-5% nulls in required columns
fail	> 5% nulls in required columns

Check 5: duplicates

What: Checks for duplicate rows by primary key (feed-specific).

Status	Criteria
pass	No duplicates found
warn	< 1% duplicate rate
fail	>= 1% duplicate rate

Primary keys: Kalshi trade: trade_id. Kalshi ticker: (market_ticker, ts, _nats_seq).

Check 6: nats_continuity

What: Checks for gaps in _nats_seq (NATS JetStream sequence numbers).

Status	Criteria
pass	No gaps in _nats_seq
warn	Gaps present but < 1% of total
fail	Gaps >= 1% of total

Common failures: NATS consumer restart, archiver reconnection.

Check 7: ts_continuity

What: Checks for time gaps in exchange timestamps. Divides the day into 15-minute slots and checks for empty slots.

Status	Criteria
pass	< 5% of 15-min slots empty
warn	5-15% empty
fail	> 15% empty

Check 8: per_ticker_stats

What: Per-ticker row counts and time coverage. Reports top/bottom tickers by volume.

Status	Criteria
pass	Always passes (informational)

Check 9: parquet_vs_jsonl (accountability)

What: Verifies every JSONL line is accounted for in parquet. The zero-tolerance accountability check.

Sources (in priority order):

parquet-gen manifest (parquet-manifest.json) — preferred, scoped to what was actually processed
Archiver manifest (manifest.json) — fallback for pre-v2.0.0 data

Comparison logic:

Data types: parquet_rows == jsonl_lines (exact match), or parquet_rows >= jsonl_lines for fanout types
Control types: counted and reported, NOT converted to parquet
Unknown types: any unrecognized type = FAIL (needs new schema)

Status	Criteria
pass	All data types accounted for
warn	Minor discrepancies or coverage < 100%
fail	Missing data or unknown types

Pipeline stats surfaced: lines_json_error, lines_type_unknown, lines_no_schema, parse_batch_dropped

u64 underflow guard: Values > 2^63 in parse_batch_dropped are treated as 0 (known Rust underflow from fanout).

Check 10: exchange_seq_gaps

What: Gaps in exchange-provided sequence numbers per file.

Grouping:

Kalshi: group by _shard_id (v1.3.0+), falls back to sid (v1.2.0), then ticker_col
Kraken Futures: group by product_id

Status	Criteria
pass	No gaps in exchange sequences
warn	Gaps present but coverage > 95%
fail	Coverage < 95%

Common failures: Kalshi pre-1.3.0 data groups by sid which collides across shards — expect false gaps on old data. Post-1.3.0 groups by _shard_id for accurate per-shard analysis.

Check 11: data_coverage

What: Percentage of 15-minute time slots (96/day) that have exchange-timestamped data.

Status	Criteria
pass	>= 95% of slots have data
warn	80-95%
fail	< 80%

Feed-specific timestamp columns: Kalshi ts, Kraken time, Polymarket timestamp_ms.

Note: Polymarket coverage may be lower due to activity concentrated in US hours.

Check 12: connection_uptime

What: WebSocket connection uptime from Cloud Monitoring websocket_connected gauge.

Status	Criteria
pass	>= 99% uptime (min across all shards)
warn	95-99%
fail	< 95%

Known issue: Ghost shards from before connector restarts have low data point counts, dragging down the min. The check should consider only the latest shard generation.

Requires: google-cloud-monitoring package, Monitoring Viewer IAM role on the SA.

Check 13: schema_version

What: Verifies parquet files report their schema version via metadata.

Status	Criteria
pass	Schema version present and recognized
warn	Schema version missing (pre-versioning data)
fail	Unknown schema version

Fanout Rules

Some message types produce multiple parquet rows per JSONL line (1:N fan-out):

Feed	Type	Reason
kraken	ticker	Batch array: one WS message contains multiple symbol updates
kraken	trade	Batch array: one WS message contains multiple trades
polymarket	price_change	Array-wrapped messages may contain multiple updates

For fanout types: parquet_rows >= jsonl_lines (not exact match). For non-fanout types: parquet_rows == jsonl_lines (exact match).

Per-Feed Expected Message Types

Feed	Data Types (have parquet schemas)	Control Types (skipped)
kalshi	ticker, trade, market_lifecycle_v2	subscribed, ok, error
kraken	ticker, trade	heartbeat, status
kraken-futures	ticker, trade	control
polymarket	book, last_trade_price, price_change, best_bid_ask	new_market, market_resolved

Trade Message Schemas

Kalshi NATS message (post-connector injection)

{
  "type": "trade",
  "sid": 2,
  "seq": 5724,
  "msg": {
    "trade_id": "uuid",
    "market_ticker": "KXBTCD-26FEB0317-T76999.99",
    "yes_price": 17,
    "count": 130,
    "taker_side": "no",
    "ts": 1770153448
  },
  "_shard_id": 3
}

Kalshi REST API trade

{
  "trade_id": "uuid",
  "ticker": "KXBTCD-26FEB0317-T76999.99",
  "yes_price": 17,
  "count": 130,
  "taker_side": "no",
  "created_time": "2026-02-03T21:17:28.18002Z"
}

Adding a New Check

Add def check_<name>(con, base, ...) -> dict to data/dq.py
Return dict with at minimum: {"check": "<name>", "status": "pass|warn|fail|skip", "detail": "..."}
Add to check_fns list in DQRunner.run()
Update this skill with the check's criteria and common failures
Deploy: tag dq-v*, update dq-daily.yaml manifest (see ssmd-deploy skill)

GCS Data Layout

gs://ssmd-data/{prefix}/{feed}/{stream}/{date}/
  manifest.json           # archiver manifest (records_by_type, files list)
  parquet-manifest.json   # parquet-gen manifest (parse_batch_input, records_written)
  ticker_000000.parquet   # parquet files per type
  trade_000000.parquet
  ...

ssmd-dq-checks

Popularity

Invocation

Context Preview

SKILL.md

ssmd-dq-checks

Popularity

Invocation

Context Preview

SKILL.md

ssmd-dq-checks

Check Catalog

Check 1: data_completeness

Check 2: record_counts

Check 3: schema

Check 4: null_rates

Check 5: duplicates

Check 6: nats_continuity

Check 7: ts_continuity

Check 8: per_ticker_stats

Check 9: parquet_vs_jsonl (accountability)

Check 10: exchange_seq_gaps

Check 11: data_coverage

Check 12: connection_uptime

Check 13: schema_version

Fanout Rules

Per-Feed Expected Message Types

Trade Message Schemas

Kalshi NATS message (post-connector injection)

Kalshi REST API trade

Adding a New Check

GCS Data Layout

Similar Skills

ssmd-dq-checks

Check Catalog

Check 1: data_completeness

Check 2: record_counts

Check 3: schema

Check 4: null_rates

Check 5: duplicates

Check 6: nats_continuity

Check 7: ts_continuity

Check 8: per_ticker_stats

Check 9: parquet_vs_jsonl (accountability)

Check 10: exchange_seq_gaps

Check 11: data_coverage

Check 12: connection_uptime

Check 13: schema_version

Fanout Rules

Per-Feed Expected Message Types

Trade Message Schemas

Kalshi NATS message (post-connector injection)

Kalshi REST API trade

Adding a New Check

GCS Data Layout

Similar Skills