From qa-data-quality
Authors and runs SodaCL (Soda Checks Language) checks against SQL warehouses (Snowflake, BigQuery, Postgres, Redshift, etc.) via `soda scan`, configures scan profiles in configuration.yml, and gates CI on scan exit code. Use when the user works with Soda Core / Soda Cloud or needs YAML-driven warehouse data quality.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-data-quality:soda-checksThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
SodaCL (Soda Checks Language) is a YAML-based domain-specific language for
SodaCL (Soda Checks Language) is a YAML-based domain-specific language for
data reliability. A typical Soda project ships two YAML files -
configuration.yml (data source connection) and checks.yml (assertions) - and runs them through the soda scan CLI (sodacl-overview,
sodacl-quickstart).
This skill covers SodaCL authoring, scan configuration, running, and CI integration. Use it when you want YAML-only assertions on warehouse tables without writing Python (in contrast to Great Expectations, which is Python-first) or compiling a dbt project.
configuration.yml + checks.yml (the canonical Soda
filenames per sodacl-quickstart).checks for <table>:, missing_count,
duplicate_count, freshness, or invalid_count - SodaCL check types
per sodacl-overview.checks.yml is a list of checks for <dataset>: blocks. Each block lists
assertions in SodaCL syntax (sodacl-overview):
# checks.yml
checks for orders:
- row_count between 1 and 10000000
- missing_count(order_id) = 0
- duplicate_count(order_id) = 0
- invalid_percent(email) < 1 %:
valid format: email
- freshness(updated_at) < 1d
checks for customers:
- row_count > 0
- missing_count(customer_id) = 0
- schema:
warn:
when required column missing: [created_at]
fail:
when forbidden column present: [pii_ssn, pii_dob]
The check vocabulary in SodaCL (sodacl-overview):
| Check | Example |
|---|---|
row_count | row_count between 10 and 1000 |
missing_count(c) | missing_count(birth_date) = 0 |
missing_percent(c) | missing_percent(email) < 1 % |
duplicate_count(c) | duplicate_count(phone) = 0 |
invalid_count(c) | invalid_count(phone) = 0 (with valid format) |
freshness(col) | freshness(start_date) < 1d |
| Aggregates | avg(safety_stock_level) > 50 |
schema | required / forbidden columns (warn/fail blocks) |
Thresholds use comparison operators inline with the check
(< 1%, between 10 and 1000, = 0). The schema check supports
explicit warn: and fail: alert configurations to differentiate
severity (sodacl-overview):
checks for transactions:
- schema:
warn:
when required column missing: [updated_at]
fail:
when forbidden column present: [pii*]
For non-schema checks, alert configurations apply via the
alert configurations: syntax - see Soda's optional-config docs at
https://docs.soda.io/soda-v3/sodacl-reference/optional-config.md for
the full grammar.
configuration.yml defines one or more data sources and (optionally)
Soda Cloud credentials. Use environment-variable substitution for
secrets - Soda expands ${ ENV_VAR } at scan time
(quick-start-dev):
# configuration.yml
data_source warehouse:
type: snowflake
username: ${ SNOWFLAKE_USER }
password: ${ SNOWFLAKE_PASS }
account: ${ SNOWFLAKE_ACCOUNT }
database: analytics
warehouse: compute_wh
role: analyst
schema: public
soda_cloud:
host: cloud.us.soda.io
api_key_id: ${ SODA_CLOUD_API_KEY }
api_key_secret: ${ SODA_CLOUD_API_SECRET }
The soda_cloud: block is optional - omit it for a fully local scan.
Canonical CLI invocation (sodacl-quickstart):
soda scan -d warehouse -c configuration.yml checks.yml
Where:
-d warehouse matches the data_source <name>: key in
configuration.yml.-c configuration.yml points at the connection file.Multiple checks files run in one scan:
soda scan -d warehouse -c configuration.yml \
checks/orders.yml checks/customers.yml checks/transactions.yml
Each scan produces a checks pass/fail summary in stdout and (when
soda_cloud: is configured) reports to Soda Cloud.
A failing scan exits non-zero and prints a per-check pass/fail line. For machine consumption, the Soda GitHub Action converts scan results to a markdown table and posts the findings as a PR comment (quick-start-dev). For other CI systems, parse stdout with the filtering snippet below.
# Capture and filter the scan output
soda scan -d warehouse -c configuration.yml checks.yml | tee scan.log
# Surface failing checks for the PR comment / job summary
grep -E '^\s*FAIL' scan.log || echo "All checks passed"
Note: if your team needs structured (JSON) scan output for richer downstream reporting, check the current
soda scanflag set on https://docs.soda.io/ - flag availability evolves between major Soda Library / Soda Core versions.
The minimal pattern is: install the Soda library + warehouse adapter,
run soda scan, exit non-zero on failure, upload the scan log as an
artifact.
# .github/workflows/soda-scan.yml
name: soda
on:
pull_request:
push:
branches: [main]
jobs:
soda-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install Soda + adapter
# Pin versions; Soda Library has both a free OSS and commercial track.
run: pip install 'soda-core' 'soda-core-snowflake'
- name: Run scan
env:
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASS: ${{ secrets.SNOWFLAKE_PASS }}
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
run: |
soda scan -d warehouse -c configuration.yml checks.yml | tee scan.log
- name: Upload scan log
if: always()
uses: actions/upload-artifact@v4
with:
name: soda-scan-log
path: scan.log
retention-days: 14
The if: always() on the artifact upload is required to capture the log
on a failed scan - when you most need it for triage. For a managed
GitHub-Action wrapper that auto-comments on PRs, see the
Soda GitHub Action.
configuration.yml shape, env-var
substitution, CI-flow example.npx claudepluginhub testland/qa --plugin qa-data-qualitySearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.