From qsv-data-wrangling
Infers semantic ontology from working directory files: entities, attributes, relationships, taxonomy, join paths. Profiles tabular data (CSV/Excel/Parquet) via qsv tools. Outputs ONTOLOGY.md.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qsv-data-wrangling:infer-ontologyThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Scan all files in the current working directory, profile each one, then synthesize a semantic ontology describing the entities, their attributes, the relationships between files, and the domain taxonomy.
Scan all files in the current working directory, profile each one, then synthesize a semantic ontology describing the entities, their attributes, the relationships between files, and the domain taxonomy.
Cowork note: If relative paths don't resolve, call
qsv_get_working_dirandqsv_set_working_dirto sync the working directory.
Sync working directory: Call qsv_get_working_dir to confirm the current path. If needed, call qsv_set_working_dir.
List files: Call qsv_list_files to get all files in the working directory. Classify each file:
Tabular (handled natively by qsv MCP Server — auto-converted to CSV if needed):
.csv, .tsv, .ssv, .tab and .sz compressed variants).xlsx, .xls, .xlsm, .xlsb, .ods).jsonl, .ndjson).parquet, .pq)Non-tabular (best-effort extraction):
.md), Text (.txt), README files — read for domain contextRun data-profile on each tabular file: For every tabular file discovered in step 2, execute the full /data-profile workflow (steps 1-13). This produces for each file:
.stats.csv cache)Run profiles sequentially to avoid overwhelming the MCP server. After each profile completes, retain the key outputs (stats, frequencies, Data Dictionary, Tags) for cross-file analysis.
Extract context from non-tabular files: For each non-tabular file:
Identify shared columns: Compare column names across all profiled files. Flag columns that appear in multiple files (exact name match or close variants like customer_id / cust_id / customerid).
Validate join candidates: For each pair of files sharing column names, read .stats.csv and check:
type from stats — both must be the same type; if mismatched, the relationship is invalid unless one side needs castingcardinality: a foreign key column typically has cardinality ≤ the primary key's cardinalityuniqueness_ratio — a value of 1.0 identifies the primary key side of the relationshipnullcount / sparsity — join columns with sparsity > 0.3 are unreliable join candidates (nulls don't match)min/max) — overlapping ranges suggest a real relationshipskewness — highly skewed join columns (|skewness| > 2) may indicate data quality issues masking relationshipsqsv_sqlp to test overlap when needed:
SELECT COUNT(DISTINCT a.col) as overlap
FROM read_csv('file1.csv') a
INNER JOIN read_csv('file2.csv') b ON a.col = b.col
Detect relationship types: Classify each validated relationship:
Define entities: Each tabular file represents one entity (or multiple if it's a denormalized/wide table). For each entity:
customers.csv → Customer)Define relationships: From the cross-file analysis (steps 5-7), document each relationship:
Infer domain taxonomy: Using all collected information (column names, value distributions, Tags from each file, non-tabular file context):
Assess data quality across the collection: Summarize cross-cutting quality concerns:
# Ontology: {Domain Name}
> Auto-generated ontology inferred from {N} files in `{working_directory}`.
> Generated: {date}
## Overview
{3-5 sentence summary of the dataset collection: what domain it covers, how many
entities, total rows across all files, key relationships, and overall data quality.}
## Domain Taxonomy
**Domain**: {Primary domain}
**Subdomains**: {Comma-separated list}
**Temporal scope**: {Earliest date} to {Latest date} (if applicable)
**Tags**: {Merged and deduplicated tags from all file profiles}
## Entities
### {Entity Name} (`{filename}`)
{Dataset Description from data-profile}
| Field | Type | Label | Description | Nullable | Cardinality | Key |
|-------|------|-------|-------------|----------|-------------|-----|
| ... | ... | ... | ... | ... | ... | PK/FK/— |
**Quality notes**: {Any quality flags from profiling — PII, high nulls, injection, etc.}
{Repeat for each entity}
## Relationships
| Source | Target | Source Column | Target Column | Type | Overlap | Join Expression |
|--------|--------|---------------|---------------|------|---------|-----------------|
| ... | ... | ... | ... | 1:N | 98.5% | `joinp --left file1.csv file2.csv --columns col` |
### Relationship Diagram
{ASCII or text-based entity-relationship diagram showing entities as boxes
and relationships as labeled arrows. Example:}
```
[Customer] 1──→N [Order] N←──1 [Product]
│ │
└──────── N:M ─────────────────┘
(via OrderItem)
```
## Controlled Vocabularies
{For columns with low cardinality that serve as classification dimensions}
### {Column Name} (`{filename}`)
| Value | Frequency | Description |
|-------|-----------|-------------|
| ... | ... | ... |
## Cross-Collection Quality Summary
| Dimension | Status | Details |
|-----------|--------|---------|
| Naming consistency | {OK/Warning} | {Details} |
| Referential integrity | {OK/Warning} | {Orphaned keys found in ...} |
| PII/PHI exposure | {OK/Warning} | {Columns flagged in ...} |
| Type consistency | {OK/Warning} | {Mismatched types for ...} |
| Completeness | {OK/Warning} | {High-null columns in ...} |
## Non-Tabular Context
{Summary of information extracted from non-tabular files that informed the ontology}
| File | Type | Contribution |
|------|------|-------------|
| ... | ... | {What domain context or definitions it provided} |
sqlp with read_parquet() rather than converted to CSVnpx claudepluginhub dathere/qsv --plugin qsv-data-wranglingBuilds business entity graph (ontology) from annotated DBML schemas and taxonomy.json. Confirms natural key strategies for cross-source merging before Kimball CDM design.
Profiles tables or files (CSV, Excel, Parquet, JSON) to reveal shape, null rates, column distributions, top values, percentiles, data quality issues, and column categories.
Profiles CSV/TSV/Excel files: detects format, counts rows/headers, computes basic/advanced statistics (kurtosis, Gini, outliers), shows top value distributions.