From datafusion-skills
Reads and explores Parquet, CSV, JSON, Arrow IPC, Avro files locally, from S3/GCS using datafusion-cli for schema inspection, row counts, and data previews.
How this skill is triggered — by the user, by Claude, or both
Slash command
/datafusion-skills:read-fileThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are helping the user read and analyze a data file using Apache DataFusion.
You are helping the user read and analyze a data file using Apache DataFusion.
Filename given: $0
Question: ${1:-describe the data}
Follow these steps in order, stopping and reporting clearly if any step fails.
Determine whether the input is local or remote:
s3://...) → remotegs://...) → remotefind "$PWD" -name "$0" -not -path '*/.git/*' 2>/dev/null
RESOLVED_PATH).Use the URI/URL as-is for RESOLVED_PATH.
For S3 access, DataFusion uses environment variables:
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGIONAWS_PROFILE for profile-based credentialsCheck if credentials are available:
test -n "$AWS_ACCESS_KEY_ID" || test -n "$AWS_PROFILE" || test -f "$HOME/.aws/credentials"
If not available, inform the user they need to configure AWS credentials.
command -v datafusion-cli
If not found, delegate to /datafusion-skills:install-datafusion and then continue.
Detect format from extension:
| Extension | Format | DataFusion support |
|---|---|---|
.parquet, .pq | Parquet | Direct query: SELECT * FROM 'file.parquet' |
.csv, .tsv, .txt | CSV | Direct query: SELECT * FROM 'file.csv' |
.json, .jsonl, .ndjson | JSON | Direct query: SELECT * FROM 'file.json' |
.arrow, .ipc, .feather | Arrow IPC | CREATE EXTERNAL TABLE with STORED AS ARROW |
.avro | Avro | CREATE EXTERNAL TABLE with STORED AS AVRO |
Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple
-c flags for multiple statements, or write a .sql file and use --file.
DataFusion v44+ supports direct queries on Parquet, CSV, and JSON files by path:
datafusion-cli -c "DESCRIBE 'RESOLVED_PATH';"
datafusion-cli -c "SELECT COUNT(*) AS row_count FROM 'RESOLVED_PATH';"
datafusion-cli -c "SELECT * FROM 'RESOLVED_PATH' LIMIT 10;"
For CSV files with non-standard delimiters or no header, fall back to CREATE EXTERNAL TABLE
using a .sql file:
cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS CSV LOCATION 'RESOLVED_PATH' OPTIONS ('has_header' 'false', 'delimiter' '\t');
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql
cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS ARROW LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql
cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS AVRO LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql
If the extension doesn't match any known format:
datafusion-cli: command not found → invoke /datafusion-skills:install-datafusion and retryOPTIONS ('has_header' 'false'), or OPTIONS ('delimiter' '\t') for TSV/datafusion-skills:datafusion-docs <error keywords> for helpUsing the schema, row count, and sample rows gathered above, answer:
${1:-describe the data: summarize column types, row count, and any notable patterns.}
Be concise but thorough — mention:
After answering, suggest relevant follow-ups:
To query this data further — filter, aggregate, join — use
/datafusion-skills:query.
If the file is useful for repeated access:
To register this as a persistent table, run
/datafusion-skills:create-table RESOLVED_PATH.
If the data is large and the user might want to materialize a summary:
To persist a summary as a Parquet file, try
/datafusion-skills:materialized-view.
Keep suggestions brief and show them only once.
/datafusion-skills:query for further exploration/datafusion-skills:create-table for persistent access/datafusion-skills:datafusion-docs for unclear errorsnpx claudepluginhub datafusion-contrib/datafusion-skills --plugin datafusion-skillsReads data files (CSV, JSON, Parquet, Avro, Excel, spatial, SQLite) or remote S3/HTTPS URLs using DuckDB. Activates for file references, 'what's in this file' queries, or dataset previews.
Runs SQL queries or natural language questions against registered tables or ad-hoc on Parquet, CSV, JSON, Arrow IPC files using datafusion-cli.
Profiles tables or files (CSV, Excel, Parquet, JSON) to reveal shape, null rates, column distributions, top values, percentiles, data quality issues, and column categories.