From skillry-data-ml-ai-engineering
Use when you need to add or review data quality validation — schema checks, null/duplicate/range assertions, Great Expectations or pandera suites, data contracts, freshness, and failure routing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/skillry-data-ml-ai-engineering:312-data-quality-validationThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Add or review automated data quality checks that catch bad data before it reaches consumers. Cover schema conformance (column presence, types, nullability), row-level assertions (null rate, uniqueness, accepted ranges, referential integrity), freshness/timeliness, volume/anomaly checks, and data contracts between producer and consumer. Establish where checks run (in-pipeline gate vs out-of-band...
Add or review automated data quality checks that catch bad data before it reaches consumers. Cover schema conformance (column presence, types, nullability), row-level assertions (null rate, uniqueness, accepted ranges, referential integrity), freshness/timeliness, volume/anomaly checks, and data contracts between producer and consumer. Establish where checks run (in-pipeline gate vs out-of-band monitor), how failures are routed (block, quarantine, or warn), and what evidence is recorded. The goal is that no silently-corrupt dataset is published.
# Find validation frameworks already in use
grep -rn "great_expectations\|import pandera\|from pandera\|soda\|dbt test\|expect_\|pa.Column\|DataFrameSchema" . --include="*.py" --include="*.yml" --include="*.yaml" | head -30
find . -name "*.yml" -path "*expectations*" -o -name "schema*.py" 2>/dev/null | head -20
For every consumed table, list columns with: type, nullable?, unique?, accepted range/set, and foreign-key target. This becomes the contract.
# null / completeness checks
grep -rn "isnull\|is_null\|notNull\|expect_column_values_to_not_be_null\|nullable=False" . --include="*.py" | head -20
# uniqueness / duplicate checks
grep -rn "duplicated\|drop_duplicates\|unique=True\|expect_column_values_to_be_unique\|COUNT(DISTINCT" . --include="*.py" --include="*.sql" | head -20
# range / set membership
grep -rn "expect_column_values_to_be_between\|isin\|ge=\|le=\|Check\." . --include="*.py" | head -20
# freshness
grep -rn "freshness\|max(updated_at\|MAX(\|expect_.*recent\|loaded_at" . --include="*.py" --include="*.sql" --include="*.yml" | head -20
Confirm each suite is wired as one of: hard gate (fail the pipeline, do not publish), quarantine (route bad rows to a holding area, publish the good ones), or warn-only monitor. The mode must match how critical the consumer is.
Confirm there is an assertion that the latest partition is recent enough and that row volume is within an expected band (to catch silent partial loads or source outages).
pandera schema as a load-time gate (Python):
import pandera as pa
from pandera import Column, Check, DataFrameSchema
orders_schema = DataFrameSchema(
{
"order_id": Column(int, nullable=False, unique=True),
"customer_id": Column(int, nullable=False),
"amount": Column(float, Check.in_range(0, 1_000_000), nullable=False),
"status": Column(str, Check.isin(["new", "paid", "shipped", "cancelled"])),
"updated_at": Column("datetime64[ns]", nullable=False),
},
strict=True, # reject unexpected columns (schema drift)
coerce=True,
)
def validate_or_raise(df):
# Raises SchemaError listing every failing column/row — fail closed.
return orders_schema.validate(df, lazy=True)
Great Expectations expectation suite (Python):
import great_expectations as gx
ctx = gx.get_context()
batch = ctx.sources.pandas_default.read_dataframe(df)
batch.expect_column_values_to_not_be_null("order_id")
batch.expect_column_values_to_be_unique("order_id")
batch.expect_column_values_to_be_between("amount", min_value=0, max_value=1_000_000)
batch.expect_column_values_to_be_in_set("status", ["new", "paid", "shipped", "cancelled"])
result = batch.validate()
assert result.success, "Data quality gate failed; not publishing."
Freshness and volume checks in SQL (run as a gate):
-- Freshness: newest row must be within the last 6 hours.
SELECT CASE WHEN MAX(updated_at) < NOW() - INTERVAL '6 hours'
THEN 1 ELSE 0 END AS is_stale
FROM analytics.orders;
-- Volume: today's count must be within 50%-200% of the 7-day average.
WITH today AS (SELECT COUNT(*) c FROM analytics.orders WHERE load_date = CURRENT_DATE),
base AS (SELECT AVG(daily) a FROM (
SELECT COUNT(*) daily FROM analytics.orders
WHERE load_date >= CURRENT_DATE - 7 GROUP BY load_date) x)
SELECT CASE WHEN today.c < 0.5*base.a OR today.c > 2.0*base.a
THEN 1 ELSE 0 END AS volume_anomaly
FROM today, base;
strict/unexpected-column check, so silent schema drift adds or renames columns unnoticed.Produce a structured report with:
file:line | gap | risk | concrete fix.Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub fluxonlab/skillry --plugin skillry-data-ml-ai-engineering