From dbt-pipeline-toolkit
Automatically profile SQL Server tables and CSV files with intelligent analysis. Detects primary key candidates, infers data types from CSV data, calculates column statistics (nulls, cardinality, data types), identifies data quality issues, and recommends appropriate dbt tests. Use when exploring source data, creating staging models, or validating data quality before transformation. Generates comprehensive profiling reports with test recommendations.
How this skill is triggered — by the user, by Claude, or both
Slash command
/dbt-pipeline-toolkit:data-profilerThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Automatically profile SQL Server tables and CSV files to understand data characteristics and guide dbt model development.
Automatically profile SQL Server tables and CSV files to understand data characteristics and guide dbt model development.
This skill provides automated data profiling with intelligent analysis to:
1 - Documentation/data-profiles/ for persistent referenceSQL Server:
SQL_USER env var (empty = Windows Auth)CSV Files:
2 - Source Files/)The skill is invoked through the Python script located in the skill's folder.
Script Location: ${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py
Important: Always use the full path to the script, not a relative path. The script is in the skill folder, not the project folder.
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table customers
Output:
1 - Documentation/data-profiles/profile_customers_TIMESTAMP.jsonpython "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table raw.epraccur
Output: Correctly splits raw.epraccur into schema raw and table epraccur for metadata queries. The full schema-qualified name is used for data queries.
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --tables raw.customers raw.orders dbo.products
Output: Profiles each table and generates individual reports
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file "2 - Source Files/dft-road-casualty-statistics-casualty-2024.csv"
Output:
1 - Documentation/data-profiles/profile_dft-road-casualty-statistics-casualty-2024_TIMESTAMP.jsonpython "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --files "2 - Source Files/dft-road-casualty-statistics-casualty-*.csv"
Output: Profiles all matching CSV files
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file data.csv --columns customer_id order_date amount
Output: Profiles only specified columns
# SQL Server table
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table large_table --sample 10000
# CSV file
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file large_file.csv --sample 10000
Output: Profiles based on sample of 10,000 rows for faster analysis
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file orders.csv --quick
Output: Fast profile with row count, column count, and basic statistics only
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file customers.csv --format json
Output: Structured JSON for integration with other tools
Note: JSON format is recommended for Claude Code agent workflows to avoid Unicode encoding issues on Windows consoles
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file customers.csv --verbose
Output: Detailed logging of profiling process
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file customers.csv --output "custom_profile.json"
Default Behavior:
1 - Documentation/data-profiles/profile_customers.json)Custom Filename:
--output to specify an alternative filename inside 1 - Documentation/data-profiles/--output "baseline.csv" to name a baseline snapshotThe profiler automatically infers data types from CSV data:
Numeric Detection: Attempts pd.to_numeric() conversion
bigint (int64) or decimal (float64)Datetime Detection: Attempts pd.to_datetime() conversion
datetime2 (datetime64[ns])Boolean Detection: Identifies True/False or 0/1 patterns
bit (bool)String Fallback: Any column not matching above
nvarchar (object)| Pandas dtype | SQL Server type | Notes |
|---|---|---|
| int64 | bigint | Whole numbers |
| float64 | decimal | Decimals, scientific notation |
| datetime64[ns] | datetime2 | Dates and timestamps |
| bool | bit | Boolean values |
| object | nvarchar | Text, mixed types |
nvarchar=== Data Profile: dft-road-casualty-statistics-casualty-2024 ===
Table Statistics:
- Total Rows: 125,432
- Total Columns: 15
- Primary Key: casualty_reference (detected)
- Profile Date: 2026-01-11 14:30:22
- Source: CSV file (types inferred)
Column Profiles:
┌─────────────────────────────────────────────────────────────────────┐
│ casualty_reference │
├─────────────────────────────────────────────────────────────────────┤
│ Data Type : bigint (inferred from int64) │
│ Nulls : 0 (0.0%) │
│ Distinct Values: 125,432 (100.0%) │
│ Min Value : 202401000001 │
│ Max Value : 202401125432 │
│ ✓ PRIMARY KEY CANDIDATE │
│ │
│ Recommended Tests: │
│ - unique │
│ - not_null │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ accident_reference │
├─────────────────────────────────────────────────────────────────────┤
│ Data Type : nvarchar (inferred from object) │
│ Nulls : 0 (0.0%) │
│ Distinct Values: 98,745 (78.7%) │
│ Min Length : 8 │
│ Max Length : 127 │
│ Pattern : Valid email format (99.5%) │
│ ⚠️ Contains nulls │
│ │
│ Recommended Tests: │
│ - not_null (if required by business logic) │
│ - email format validation (custom test) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ casualty_severity │
├─────────────────────────────────────────────────────────────────────┤
│ Data Type : bigint (inferred from int64) │
│ Nulls : 0 (0.0%) │
│ Distinct Values: 3 (0.002%) │
│ Values : 1 (78,234), 2 (35,198), 3 (12,000) │
│ │
│ Recommended Tests: │
│ - not_null │
│ - accepted_values: [1, 2, 3] │
└─────────────────────────────────────────────────────────────────────┘
... (all columns profiled)
Data Quality Issues:
⚠️ age_of_casualty has 1,234 nulls (0.98%) - Handle in staging model
⚠️ casualty_home_area_type has 5,432 nulls (4.3%) - Consider if needed
⚠️ casualty_severity has only 3 distinct values - Use accepted_values test
Recommendations for Staging Model:
1. Use casualty_reference as primary key
2. Add unique + not_null tests to casualty_reference
3. Add not_null tests to: accident_reference, casualty_class, casualty_severity
4. Add accepted_values test to casualty_severity column
5. Handle nulls in age_of_casualty and casualty_home_area_type
6. Add relationship test from accident_reference to accidents table
Suggested dbt YAML:
```yaml
models:
- name: stg_dft__casualties
description: "Staging model for road casualty statistics"
columns:
- name: casualty_reference
description: "Primary key for casualties"
data_tests:
- unique
- not_null
- name: email
description: "Customer email address"
data_tests:
- not_null
- name: status
description: "Customer account status"
data_tests:
- not_null
- accepted_values:
arguments:
values: ['Active', 'Inactive', 'Pending']
Export saved to: 1 - Documentation/data-profiles/profile_customers_20260111_143022.json
## What This Skill Analyzes
### For All Columns:
- Data type and precision
- Null count and percentage
- Distinct value count and cardinality
- Min/max values (for numeric/date columns)
- Min/max length (for string columns)
- Most common values (top 5)
### For Potential Primary Keys:
- Uniqueness (100% distinct = strong PK candidate)
- Nullness (0% null = strong PK candidate)
- Cardinality (high = good PK, low = dimension attribute)
- Data type (integer/bigint = typical PK)
### For String Columns:
- Length distribution
- Pattern detection (email, phone, URL, etc.)
- Character set analysis
### For Numeric Columns:
- Range (min, max, avg)
- Distribution (percentiles)
- Outlier detection
### For Date Columns:
- Date range (earliest, latest)
- Gaps in dates
- Recency analysis
## Test Recommendations Logic
The profiler automatically recommends tests based on data patterns:
### Primary Key Tests
- **Triggers**: 100% distinct values + 0% nulls
- **Recommends**: `unique` + `not_null`
### Not Null Tests
- **Triggers**: 0% nulls + important column (PK, FK, measure)
- **Recommends**: `not_null`
### Accepted Values Tests
- **Triggers**: Low cardinality (< 10 distinct values)
- **Recommends**: `accepted_values` with list of values
### Relationships Tests
- **Triggers**: Column name ends with `_id` or `_key` + not the PK
- **Recommends**: `relationships` to related dimension
### Custom Tests
- **Email format**: Pattern matches email structure
- **Date range**: Dates in future or far past
- **Negative values**: Where business logic prohibits them
- **String length**: Exceeds expected max
## Integration with Agents
### dbt-staging-builder Agent
The primary use case - dbt-staging-builder uses data-profiler to:
1. **CSV Files**: Profile source CSV to understand structure before loading
2. **SQL Tables**: Profile raw tables after loading to SQL Server
3. Identify primary key for surrogate key generation
4. Detect nullable columns for null handling
5. Get test recommendations for schema.yml
6. Understand inferred data types for SQL casting
**Workflow (CSV source)**:
```bash
# dbt-staging-builder invokes data-profiler on CSV
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file "2 - Source Files/casualties-2024.csv"
# Reviews inferred types and profile
# Creates staging model with appropriate:
# - Type casting (CSV string → SQL types)
# - Primary key selection
# - Null handling
# - Initial tests
Workflow (SQL table):
# dbt-staging-builder invokes data-profiler on raw table
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table raw_customers
# Reviews profile output
# Creates staging model with appropriate:
# - Primary key selection
# - Null handling
# - Type casting
# - Initial tests
Use during discovery phase to:
Use to design comprehensive tests:
All agents can reference previously generated profiles from the documentation folder:
Profile Location: 1 - Documentation/data-profiles/
Reading Profiles:
# Agents can read JSON profiles using Read tool
profile_path = "1 - Documentation/data-profiles/profile_customers_20260113_143022.json"
# Profile contains:
# - table_name, total_rows, total_columns
# - primary_key_candidates
# - columns[] with data_type, null_percentage, distinct_count, etc.
# - quality_issues[]
# - recommendations[]
Benefits:
Best Practice: Profile once, reference many times. Profiles are timestamped and permanent.
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table customers --pk-threshold 0.95
Default threshold: 99% distinct + 1% nulls Custom threshold: 95% distinct + 5% nulls (more lenient)
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table customers --columns customer_id email status
Output: Profiles only specified columns for faster analysis
# Profile source table
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table raw_customers --output raw_profile.csv
# Profile staging model after transformation
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table stg_erp__customers --output stg_profile.csv
# Compare the two CSV files
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table customers --generate-yaml
Output: Generates ready-to-use dbt schema.yml with:
Structured JSON saved to 1 - Documentation/data-profiles/ for automation and agent integration.
Benefits:
Formatted text with boxes and symbols for easy reading in terminal.
Use Case: Quick visual inspection during development
Structured export with columns:
Structured JSON for automation and integration:
{
"table_name": "customers",
"total_rows": 125432,
"total_columns": 15,
"primary_key_candidate": "customer_id",
"profile_date": "2026-01-11T14:30:22",
"columns": [
{
"name": "customer_id",
"type": "bigint",
"nulls": 0,
"null_pct": 0.0,
"distinct": 125432,
"cardinality_pct": 100.0,
"min": 1,
"max": 125432,
"is_pk_candidate": true,
"recommended_tests": ["unique", "not_null"]
}
],
"quality_issues": [
"phone_number has 12,345 nulls (9.8%)"
],
"recommendations": [
"Use customer_id as primary key",
"Add unique + not_null tests to customer_id"
]
}
Generate ready-to-use schema.yml:
models:
- name: stg_source__customers
columns:
- name: customer_id
data_tests:
- unique
- not_null
- name: email
data_tests:
- not_null
# Profile the table
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table raw_customers
# Review output and identify:
# - Primary key: customer_id
# - Nullable columns: phone_number (9.8%)
# - Categorical columns: status (3 values)
# Create staging model based on insights
# Get list of tables
python "${CLAUDE_PLUGIN_ROOT}/skills/sql-server-reader/scripts/query_sql_server.py" --list-tables
# Profile each source table
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --tables customers orders products line_items
# Profile to establish baseline
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table orders --output baseline.csv
# After data cleanup
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table orders --output after_cleanup.csv
# Compare null percentages and data quality metrics
--sample to limit profiling to representative sample--columns instead of all columns--quick mode for basic statistics onlyThe profiler uses efficient SQL:
pyodbc - SQL Server ODBC driverpandas - Data manipulationnumpy - Statistical calculationssqlalchemy - Database connectionsargparse, json, pathlib, datetimepip install pyodbc pandas numpy sqlalchemy
| Feature | sql-server-reader | data-profiler |
|---|---|---|
| Query execution | ✅ Manual queries | ✅ Automated profiling |
| List tables | ✅ Yes | ❌ Use sql-server-reader |
| Table schemas | ✅ Yes | ❌ Use sql-server-reader |
| Column statistics | ❌ Manual | ✅ Automatic |
| PK detection | ❌ No | ✅ Yes |
| Test recommendations | ❌ No | ✅ Yes |
| Data quality insights | ❌ No | ✅ Yes |
| Pattern detection | ❌ No | ✅ Yes |
Recommendation: Use both tools together
business-analyst can USE data-profiler to accelerate discovery.
Manual approach requires 10+ queries per table:
-- Row count
SELECT COUNT(*) FROM table
-- Null counts per column
SELECT COUNT(*) FROM table WHERE col1 IS NULL
SELECT COUNT(*) FROM table WHERE col2 IS NULL
-- ... repeat for every column
-- Distinct values
SELECT COUNT(DISTINCT col1) FROM table
-- ... repeat for every column
data-profiler automates all of this in a single command.
Use sql-server-reader test connection:
python "${CLAUDE_PLUGIN_ROOT}/skills/sql-server-reader/scripts/query_sql_server.py" --test-connection
Verify table exists:
python "${CLAUDE_PLUGIN_ROOT}/skills/sql-server-reader/scripts/query_sql_server.py" --list-tables
Use sample for large tables:
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table large_table --sample 50000
Profile columns individually:
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table huge_table --columns id name status
Problem: On Windows, the console may fail to display Unicode characters (emoji, box-drawing) with error:
UnicodeEncodeError: 'charmap' codec can't encode characters
Solutions:
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file data.csv --format json
# PowerShell
$env:PYTHONIOENCODING="utf-8"
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file data.csv
# Command Prompt
set PYTHONIOENCODING=utf-8
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file data.csv
--output (no shell redirection needed):python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --file data.csv --format json --output profile_output.json
Note: JSON format is recommended for Claude Code agent workflows to avoid encoding issues entirely.
Profiles are automatically saved to 1 - Documentation/data-profiles/ with benefits:
Organization:
1 - Documentation/
└── data-profiles/
├── profile_customers_20260113_143022.json
├── profile_orders_20260113_143045.json
├── profile_dft-road-casualty-statistics-casualty-2019_20260113_143100.json
└── README.md # Document profiling history
# Step 1: Profile source table
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table raw_customers --verbose
# Review output
# - PK: customer_id (100% distinct, 0% null)
# - Nulls: phone_number (10%), secondary_email (76%)
# - Categorical: status (3 values: Active, Inactive, Pending)
# Step 2: Create staging model
# models/staging/erp/stg_erp__customers.sql
# Step 3: Add tests from recommendations
# models/staging/erp/_stg_erp__schema.yml
# Profile all ERP source tables
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --tables \
raw_customers \
raw_orders \
raw_products \
raw_line_items \
raw_payments
# Review all profiles to identify:
# - Primary keys for each table
# - Foreign key relationships
# - Data quality issues
# - Test requirements
# Profile to identify data quality issues
python "${CLAUDE_PLUGIN_ROOT}/skills/data-profiler/scripts/profile_data.py" --table orders --verbose
# Output shows:
# - order_total has negative values (data quality issue)
# - ship_date is null for 5% of orders
# - customer_id has values not in customers table (referential integrity issue)
# Fix issues at source or handle in staging model
Key Value: Automates what would require 10-20 manual queries, provides intelligent recommendations, and accelerates staging model development by 70%.
Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub kavasimihaly/ai-plugins --plugin dbt-pipeline-toolkit