Skill

read-file

Reads and explores Parquet, CSV, JSON, Arrow IPC, Avro files locally, from S3/GCS using datafusion-cli for schema inspection, row counts, and data previews.

Bash

AWS

GCP

data-engineering

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/datafusion-skills:read-file

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

Bash

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are helping the user read and analyze a data file using Apache DataFusion.

SKILL.md

176 lines · ~1.4k tokens

Stats

Stars12

MaintenanceGood

Last CommitMar 21, 2026

Actions

View Source View Plugin View on GitHub View README

Step 1 — Classify and resolve the path

Determine whether the input is local or remote:

S3 URI (s3://...) → remote
GCS URI (gs://...) → remote
HTTPS/HTTP URL → remote (DataFusion supports HTTP via object_store)
Otherwise → local file

Local files

find "$PWD" -name "$0" -not -path '*/.git/*' 2>/dev/null

Zero results → tell the user the file was not found and stop.
More than one result → list all matches, ask the user to re-run with a fuller path, and stop.
Exactly one result → use that full path (RESOLVED_PATH).

Remote files

Use the URI/URL as-is for RESOLVED_PATH.

For S3 access, DataFusion uses environment variables:

AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
Or AWS_PROFILE for profile-based credentials

Check if credentials are available:

test -n "$AWS_ACCESS_KEY_ID" || test -n "$AWS_PROFILE" || test -f "$HOME/.aws/credentials"

If not available, inform the user they need to configure AWS credentials.

Step 2 — Check datafusion-cli is installed

command -v datafusion-cli

If not found, delegate to /datafusion-skills:install-datafusion and then continue.

Step 3 — Detect file format and read

Detect format from extension:

Extension	Format	DataFusion support
`.parquet`, `.pq`	Parquet	Direct query: `SELECT * FROM 'file.parquet'`
`.csv`, `.tsv`, `.txt`	CSV	Direct query: `SELECT * FROM 'file.csv'`
`.json`, `.jsonl`, `.ndjson`	JSON	Direct query: `SELECT * FROM 'file.json'`
`.arrow`, `.ipc`, `.feather`	Arrow IPC	`CREATE EXTERNAL TABLE` with `STORED AS ARROW`
`.avro`	Avro	`CREATE EXTERNAL TABLE` with `STORED AS AVRO`

Important: datafusion-cli -c only accepts one SQL statement per flag. Use multiple -c flags for multiple statements, or write a .sql file and use --file.

For Parquet, CSV, and JSON files (direct query):

DataFusion v44+ supports direct queries on Parquet, CSV, and JSON files by path:

datafusion-cli -c "DESCRIBE 'RESOLVED_PATH';"

datafusion-cli -c "SELECT COUNT(*) AS row_count FROM 'RESOLVED_PATH';"

datafusion-cli -c "SELECT * FROM 'RESOLVED_PATH' LIMIT 10;"

For CSV files with non-standard delimiters or no header, fall back to CREATE EXTERNAL TABLE using a .sql file:

cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS CSV LOCATION 'RESOLVED_PATH' OPTIONS ('has_header' 'false', 'delimiter' '\t');
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql

For Arrow IPC files:

cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS ARROW LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql

For Avro files:

cat > /tmp/_df_preview.sql << 'SQL'
CREATE EXTERNAL TABLE _preview STORED AS AVRO LOCATION 'RESOLVED_PATH';
DESCRIBE _preview;
SELECT COUNT(*) AS row_count FROM _preview;
SELECT * FROM _preview LIMIT 10;
SQL
datafusion-cli --file /tmp/_df_preview.sql

Unknown format

If the extension doesn't match any known format:

Try Parquet first (most common in data engineering)
Then try CSV with auto-detection
Report the error and suggest the user specify the format

Step 4 — Handle errors

datafusion-cli: command not found → invoke /datafusion-skills:install-datafusion and retry
File not found → double-check the path, suggest using absolute path
Parse error on CSV → try different options: OPTIONS ('has_header' 'false'), or OPTIONS ('delimiter' '\t') for TSV
S3 access denied → remind user to configure AWS credentials
Persistent error → use /datafusion-skills:datafusion-docs <error keywords> for help

Step 5 — Answer the question

Using the schema, row count, and sample rows gathered above, answer:

${1:-describe the data: summarize column types, row count, and any notable patterns.}

Be concise but thorough — mention:

Number of columns and their types
Row count
Any notable patterns in the sample (nulls, date ranges, value distributions)

Step 6 — Suggest next steps

After answering, suggest relevant follow-ups:

To query this data further — filter, aggregate, join — use /datafusion-skills:query.

If the file is useful for repeated access:

To register this as a persistent table, run /datafusion-skills:create-table RESOLVED_PATH.

If the data is large and the user might want to materialize a summary:

To persist a summary as a Parquet file, try /datafusion-skills:materialized-view.

Keep suggestions brief and show them only once.

Cross-skill integration

Query follow-ups: Suggest /datafusion-skills:query for further exploration
Table registration: Suggest /datafusion-skills:create-table for persistent access
Error troubleshooting: Use /datafusion-skills:datafusion-docs for unclear errors

read-file

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

read-file

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Step 1 — Classify and resolve the path

Local files

Remote files

Step 2 — Check datafusion-cli is installed

Step 3 — Detect file format and read

For Parquet, CSV, and JSON files (direct query):

For Arrow IPC files:

For Avro files:

Unknown format

Step 4 — Handle errors

Step 5 — Answer the question

Step 6 — Suggest next steps

Cross-skill integration

Similar Skills

Step 1 — Classify and resolve the path

Local files

Remote files

Step 2 — Check datafusion-cli is installed

Step 3 — Detect file format and read

For Parquet, CSV, and JSON files (direct query):

For Arrow IPC files:

For Avro files:

Unknown format

Step 4 — Handle errors

Step 5 — Answer the question

Step 6 — Suggest next steps

Cross-skill integration

Similar Skills