From research-analysis
Use this skill when the user uploads Excel (.xlsx/.xls) or CSV files and wants to perform data analysis, generate statistics, create summaries, pivot tables, SQL queries, or any form of structured data exploration. Supports multi-sheet Excel workbooks, aggregation, filtering, joins, and exporting results to CSV/JSON/Markdown.
How this skill is triggered — by the user, by Claude, or both
Slash command
/research-analysis:data-analysisThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill analyzes user-uploaded Excel/CSV files using DuckDB — an in-process analytical SQL engine. It supports schema inspection, SQL-based querying, statistical summaries, and result export, all through a single Python script.
This skill analyzes user-uploaded Excel/CSV files using DuckDB — an in-process analytical SQL engine. It supports schema inspection, SQL-based querying, statistical summaries, and result export, all through a single Python script.
This skill uses a sandboxed temporary directory to avoid polluting the user's workspace:
| Base | Meaning |
|---|---|
./scripts/ | Relative to SKILL.md directory (skill root) — contains Python scripts |
<upload-dir> | Directory containing user-uploaded data files (provided by user) |
<sandbox-dir> | Temporary directory for this skill session (auto-created, auto-cleaned) |
<sandbox-dir>/outputs/ | Stores exported results |
<sandbox-dir>/.cache/ | Stores DuckDB cache files |
Important: All temporary files are stored in
<sandbox-dir>. The agent should create this directory at session start and clean it up at session end (unless the user wants to keep the outputs).
At the start of each skill session, the agent should:
# Create sandbox directory with unique session ID
SANDBOX_DIR="/tmp/skill-sessions/data-analysis-$(date +%s)"
mkdir -p "$SANDBOX_DIR/outputs" "$SANDBOX_DIR/.cache"
At the end of the session (or when the user confirms), clean up:
# Remove sandbox directory (keeping outputs if user wants)
rm -rf "$SANDBOX_DIR"
When a user uploads data files and requests analysis, identify:
<upload-dir>/First, inspect the uploaded file to understand its schema:
python ./scripts/analyze.py \
--files <upload-dir>/data.xlsx \
--action inspect
This returns:
Based on the schema, construct SQL queries to answer the user's questions.
python ./scripts/analyze.py \
--files <upload-dir>/data.xlsx \
--action query \
--sql "SELECT category, COUNT(*) as count, AVG(amount) as avg_amount FROM Sheet1 GROUP BY category ORDER BY count DESC"
python ./scripts/analyze.py \
--files <upload-dir>/data.xlsx \
--action summary \
--table Sheet1
This returns for each numeric column: count, mean, std, min, 25%, 50%, 75%, max, null_count. For string columns: count, unique, top value, frequency, null_count.
python ./scripts/analyze.py \
--files <upload-dir>/data.xlsx \
--action query \
--sql "SELECT * FROM Sheet1 WHERE amount > 1000" \
--output-file <sandbox-dir>/outputs/filtered-results.csv
Supported output formats (auto-detected from extension):
.csv — Comma-separated values.json — JSON array of records.md — Markdown table| Parameter | Required | Description |
|---|---|---|
--files | Yes | Space-separated paths to Excel/CSV files |
--action | Yes | One of: inspect, query, summary |
--sql | For query | SQL query to execute |
--table | For summary | Table/sheet name to summarize |
--output-file | No | Path to export results (CSV/JSON/MD) |
[!NOTE] Do NOT read the Python file, just call it with the parameters.
Sheet1, Sales, Revenue)data.csv → data)"2024_Sales"-- Row count
SELECT COUNT(*) FROM Sheet1
-- Distinct values in a column
SELECT DISTINCT category FROM Sheet1
-- Value distribution
SELECT category, COUNT(*) as cnt FROM Sheet1 GROUP BY category ORDER BY cnt DESC
-- Date range
SELECT MIN(date_col), MAX(date_col) FROM Sheet1
-- Revenue by category and month
SELECT category, DATE_TRUNC('month', order_date) as month,
SUM(revenue) as total_revenue
FROM Sales
GROUP BY category, month
ORDER BY month, total_revenue DESC
-- Top 10 customers by spend
SELECT customer_name, SUM(amount) as total_spend
FROM Orders GROUP BY customer_name
ORDER BY total_spend DESC LIMIT 10
-- Join sales with customer info from different files
SELECT s.order_id, s.amount, c.customer_name, c.region
FROM sales s
JOIN customers c ON s.customer_id = c.id
WHERE s.amount > 500
-- Running total and rank
SELECT order_date, amount,
SUM(amount) OVER (ORDER BY order_date) as running_total,
RANK() OVER (ORDER BY amount DESC) as amount_rank
FROM Sales
-- Pivot: monthly revenue by category
SELECT category,
SUM(CASE WHEN MONTH(date) = 1 THEN revenue END) as Jan,
SUM(CASE WHEN MONTH(date) = 2 THEN revenue END) as Feb,
SUM(CASE WHEN MONTH(date) = 3 THEN revenue END) as Mar
FROM Sales
GROUP BY category
User uploads sales_2024.xlsx (with sheets: Orders, Products, Customers) and asks: "Analyze my sales data — show top products by revenue and monthly trends."
python ./scripts/analyze.py \
--files <upload-dir>/sales_2024.xlsx \
--action inspect
python ./scripts/analyze.py \
--files <upload-dir>/sales_2024.xlsx \
--action query \
--sql "SELECT p.product_name, SUM(o.quantity * o.unit_price) as total_revenue, SUM(o.quantity) as total_units FROM Orders o JOIN Products p ON o.product_id = p.id GROUP BY p.product_name ORDER BY total_revenue DESC LIMIT 10"
python ./scripts/analyze.py \
--files <upload-dir>/sales_2024.xlsx \
--action query \
--sql "SELECT DATE_TRUNC('month', order_date) as month, SUM(quantity * unit_price) as revenue FROM Orders GROUP BY month ORDER BY month" \
--output-file <sandbox-dir>/outputs/monthly-trends.csv
python ./scripts/analyze.py \
--files <upload-dir>/sales_2024.xlsx \
--action summary \
--table Orders
Present results to the user with clear explanations of findings, trends, and actionable insights.
User uploads orders.csv and customers.xlsx and asks: "Which region has the highest average order value?"
python ./scripts/analyze.py \
--files <upload-dir>/orders.csv <upload-dir>/customers.xlsx \
--action query \
--sql "SELECT c.region, AVG(o.amount) as avg_order_value, COUNT(*) as order_count FROM orders o JOIN Customers c ON o.customer_id = c.id GROUP BY c.region ORDER BY avg_order_value DESC"
After analysis:
present_files toolThe script automatically caches loaded data to avoid re-parsing files on every call:
<sandbox-dir>/.cache/This is especially useful when running multiple queries against the same data files (inspect → query → summary).
After analysis:
present_files tool<sandbox-dir> when session ends (unless user wants to keep files)DATE_TRUNC, EXTRACT, etc.)"Column Name"Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub shanghai-jerry/skills --plugin research-analysis