From qa-data-quality
Reference catalog of data-quality conventions - when to choose dbt-tests vs Great Expectations vs Soda, column-level vs table-level coverage, severity tiering, SLA and freshness conventions, and common anti-patterns to avoid. Use when designing coverage for a new data product or auditing an existing one.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-data-quality:data-quality-conventionsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A reference catalog for **how** to design data-quality coverage. Pairs
A reference catalog for how to design data-quality coverage. Pairs
with the engine-specific skills in this plugin
(dbt-testing,
great-expectations,
soda-checks) - those tell you the how
of running checks; this tells you which checks and where.
Use this matrix to pick an engine before authoring any checks. Mixing
engines is fine in a large platform - the data-quality-gate
skill exists precisely to reconcile their outputs.
| Use case | Preferred engine | Why |
|---|---|---|
| Repo is a dbt project; assertions live next to models | dbt-tests | Tests run as part of dbt build, share the project's adapter, and live in the same schema.yml as the column docs. |
| Python ELT, programmatic suite generation, runtime data connectors | Great Expectations | First-class Pandas / SQL / Spark batches; suites composable in Python. |
| YAML-only checks against a SQL warehouse, cross-team observability via Soda Cloud | Soda | SodaCL is a focused DSL; no Python or dbt required. |
| Mix of warehouse + non-warehouse sources | dbt + GX or Soda + GX | Use dbt or Soda for warehouse coverage; GX for non-warehouse (file uploads, Spark jobs, ad-hoc Pandas). |
| Brand-new project, no preference | dbt-tests if you already have dbt; otherwise Soda for YAML simplicity | Lowest cognitive overhead. |
Anti-pattern: running all three engines on the same dataset for the same checks "for redundancy" - duplicate checks shift the failure cost to humans triaging three CI signals for the same root cause.
Authoring tip: start at the column level for the keys and required fields, then add table-level invariants. A reasonable starter set for a new data product:
| Layer | Check | Justification |
|---|---|---|
| Column (PK) | not_null, unique | Catches the largest classes of ingestion bugs. |
| Column (FK) | relationships to the referenced table | Catches out-of-order loads. |
| Column (enum) | accepted_values for low-cardinality strings | Catches new states added upstream without coordination. |
| Column (numeric) | range / between for business-bounded columns | Catches arithmetic errors and unit mismatches. |
| Column (date) | not_null on required timestamps | Catches dropped fields. |
| Table | row_count > 0 (and an upper bound if known) | Catches catastrophic ingestion drop or runaway duplicate. |
| Table | freshness on the load-time timestamp | Catches "pipeline didn't run" - the most common silent failure. |
For numeric columns, prefer business-bounded ranges (e.g. discount 0 - 100 %) over distribution-bounded ranges (e.g. mean ± 3σ). Statistical ranges drift with seasonality and produce false positives at scale; the shop should treat distribution monitoring as a separate concern from data-quality assertions.
Three tiers cover most needs:
| Tier | Behavior on failure | Use for |
|---|---|---|
error | Block the pipeline / CI | Invariants the business depends on (PK uniqueness, FK integrity, required fields). |
warn | Surface in the report, do not block | Distribution shifts, soft constraints, new unfamiliar checks during ramp-up. |
info | Log only, never alert | Coverage telemetry (e.g. "70 % of customer rows had email populated"). |
In dbt, severity comes from a severity: config block; in GX, from
suite-level metadata; in SodaCL, from warn: / fail: blocks for
schema and from the alert configuration syntax for other checks. Each
engine's specifics are in its SKILL.md.
Anti-pattern: every check at error. A pipeline that blocks on a
single distribution drift becomes "the check that always fails" within
a quarter; the team eventually disables the gate entirely. Reserve
error for invariants the business will actually halt to fix.
A freshness check is the highest-leverage assertion in most pipelines - it catches the "pipeline didn't run" failure mode that no row-level check can detect (because there are no rows to check).
Conventions:
freshness < 2d; an hourly pipeline gets < 2h.
This avoids false alarms from a single late run while still catching
a stuck pipeline.error-severity by convention - a stuck pipeline blocks
every downstream business decision, so a noisy alert is the right
behavior.not_null_orders_email (dbt-style) or
missing_count(email) = 0 (SodaCL) is self-documenting; assertion_42
is not.meta: block under the column or
model with owner: @team-handle. GX: suite-level metadata. SodaCL:
the dataset's meta: block. The data-anomaly-triager
agent reads these to route failures.| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Hard-coded distributions ("revenue between $X and $Y") | Drifts with seasonality; ages out within months | Bound with a moving window or replace with a separate distribution-monitoring tool. |
not_null on every column | Some columns are legitimately nullable; floods the failure list with non-actionable noise | Annotate nullable columns explicitly in schema.yml and skip the test. |
| One mega-check that asserts an entire row at once | Hard to triage; a single failure does not tell you which field broke | Split into one check per invariant; aggregate at the gate level via the data-quality-gate. |
| Reaching into upstream raw data inside a check | Couples the assertion to source schema; breaks on every upstream rename | Test the curated model output, not the raw source. |
| Same check authored in dbt and GX and Soda | Three failure signals for one root cause; triage cost triples | Pick one engine per check; use the gate skill to unify cross-engine coverage. |
Severity error on a freshly added check | First false positive disables the gate by social convention | Land new checks at warn for two cycles, then promote once they're seen to be stable. |
A check is healthy when it fails roughly once per quarter on a real issue. Two failure modes warrant retirement:
npx claudepluginhub testland/qa --plugin qa-data-qualitySearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.