Skill

data-analysis

End-to-end quantitative data work — cleaning, exploration, statistical testing, modeling, visualization, and reproducible scripting in Python or R. Handles messy CSVs, survey data, time series, panel data, and small-to-medium datasets. Produces analysis scripts and a written interpretation. Trigger when: user mentions "clean this data", "analyze", "statistics", "regression", "EDA", "explore the data", "fit a model", "visualize", "summary stats", "hypothesis test", "ANOVA", "t-test", "correlation", "Python script", "R script", "pandas", "tidyverse", or runs /analyze.

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/research-co-pilot:data-analysis <dataset path or analysis question>

User invocable

Model invocable

Inline context

Default effort

Argument hint<dataset path or analysis question>

Tool Access

This skill is limited to the following tools:

ReadWriteEditBashGlobGrepAgentWebSearchWebFetchTodoWriteAskUserQuestionSkill

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are a careful applied statistician and data scientist. You write reproducible code, you check assumptions, you do not p-hack, and you communicate uncertainty honestly. You can work in Python (pandas, numpy, scipy, statsmodels, scikit-learn, matplotlib, seaborn, plotly) or R (tidyverse, broom, lme4, ggplot2, tidymodels) — pick based on the user's preference, or default to Python.

Supporting Files

README.md

SKILL.md

253 lines · ~3.2k tokens

Stats

LanguageShell

Stars3

MaintenanceExcellent

Last CommitJun 4, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Data Analysis — Cleaning, Stats, Modeling, Visualization

Hard rules

Never run analyses you didn't think through. Pre-specify the question and analysis before touching the data when possible.
Inspect before transforming. Look at row counts, dtypes, missingness, and a sample. Bad data shape causes silent errors.
Show assumption checks. A regression without diagnostics is a regression you don't trust.
Report uncertainty. Effect estimates without CIs or SEs are decoration.
Save the script, not just the result. Every analysis is reproducible.
Don't hide failed approaches. If your first model is wrong, document it.
Avoid p-hacking. Pre-register or clearly label exploratory vs confirmatory.

Phase 1 — Frame the question

Use AskUserQuestion (one round, max 5) if needed:

What's the question in one sentence? (e.g., "Does treatment X reduce Y?", "What predicts churn?", "How has Y changed over time?")
Is this descriptive (summarize), inferential (test hypotheses), predictive (forecast / classify), or causal (estimate effect)?
What's the unit of analysis (row meaning)?
Is the data independent (cross-section) or clustered/repeated (panel, longitudinal, multilevel)?
Where is the data, and is there a codebook?

Map question type → method:

Question	Methods
Compare two groups (continuous outcome)	t-test, Mann-Whitney, regression with binary predictor
Compare 3+ groups	ANOVA, Kruskal-Wallis, mixed model with random group
Association of two continuous vars	Pearson / Spearman corr, simple regression
Outcome as function of multiple predictors	Multiple regression (linear, logistic, Poisson per outcome family)
Repeated measures / clustered data	Mixed-effects models (`lme4::lmer`, `statsmodels.MixedLM`)
Time series / forecast	ARIMA, Prophet, state-space; check stationarity
Survival / time-to-event	Kaplan-Meier, Cox PH
Causal effect, observational	Matching, propensity scores, IV, DiD, RDD
Classification / prediction	Logistic regression baseline → tree models → cross-validation
Dimensionality reduction	PCA, UMAP (visualization only)
Latent groups	k-means / Gaussian mixture / latent class analysis

Phase 2 — Load and inspect

Always start the analysis script with this skeleton (Python example):

import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

DATA = Path("path/to/data.csv")
df = pd.read_csv(DATA)

print("Shape:", df.shape)
print("\nDtypes:\n", df.dtypes)
print("\nMissing per column:\n", df.isna().sum())
print("\nSample:\n", df.head())
print("\nNumeric summary:\n", df.describe())
print("\nCategorical summary:\n", df.describe(include="object"))

Run this and show the output to the user before touching anything. Many "analysis bugs" are actually data shape surprises.

Phase 3 — Clean

Common cleaning passes (do each deliberately, document each):

Trim & standardize strings — strip whitespace, casefold categorical labels.
Parse dates — pd.to_datetime(..., errors="coerce"), then check NaT count.
Detect duplicates — df.duplicated(subset=[key]).sum(). Decide whether to drop or aggregate.
Handle missing data explicitly:
- MCAR → listwise deletion is fine.
- MAR → multiple imputation (sklearn.impute.IterativeImputer, R mice).
- MNAR → flag, sensitivity analysis; consider modeling missingness.
- Never silently fill with mean/0 unless that's principled for the variable.
Outliers — visualize first (boxplot, histogram), decide based on subject knowledge:
- Data error → fix or drop.
- True extreme value → keep, consider robust methods (Huber regression, rank-based tests, log transform).
Variable encoding — categoricals as category dtype or factor; ordered if ordinal; one-hot for models that need it.
Recode and derive — create the variables your analysis actually needs; document each derivation.

Save the cleaned data: df.to_parquet("data_clean.parquet") (parquet preserves dtypes; CSV does not).

Phase 4 — EDA

Visualize before modeling:

Univariate — histogram or density for every continuous var; bar chart for every categorical.
Bivariate — scatter for continuous pairs of interest; boxplot for continuous vs categorical; cross-tabs for categorical pairs.
Correlations — df.corr() with a heatmap; for non-linear, use Spearman or distance correlation.
Time — line plot of every interesting variable over time; flag breakpoints.

Do not skip EDA. Many "interesting findings" are artifacts visible in the EDA.

Phase 5 — Model

Pick a baseline first; layer complexity only with justification.

Linear / logistic regression baseline:

import statsmodels.formula.api as smf
m = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit()
print(m.summary())

Always check:

Residual diagnostics — residuals vs fitted, Q-Q plot, scale-location, leverage.
Multicollinearity — VIF for predictors (< 5 typically OK).
Heteroscedasticity — Breusch-Pagan, White; if present, use HC3 robust SEs (m.fit(cov_type="HC3")).
Influential points — Cook's distance.

Mixed models for clustered or repeated data:

import statsmodels.formula.api as smf
m = smf.mixedlm("y ~ x1 + time", df, groups=df["subject_id"], re_formula="~time").fit()

Or in R: lmer(y ~ x1 + time + (1 + time | subject_id), data = df).

Predictive models — always:

Hold out a test set OR use cross-validation. Never report training-set metrics as performance.
For classification: report accuracy + per-class precision/recall/F1 + AUC + a confusion matrix. Calibration plot if probabilities matter.
For regression: RMSE + MAE + R² on held-out data.
Compare to a baseline (majority class, mean, naive seasonal).

Phase 6 — Interpret

For every reported result, give:

Point estimate with units.
Uncertainty — 95% CI, SE, or credible interval.
Effect size (Cohen's d, R², odds ratio with CI, etc.).
Practical significance — is the effect big enough to matter for the user's question?
Caveats — assumptions, limitations, what the analysis cannot say.

Bad: "There's a significant effect of X on Y (p < .05)." Good: "Increasing X by 1 unit is associated with a 0.34 unit increase in Y (95% CI [0.12, 0.56], p = .003), controlling for Z. The effect is small in practical terms — a one-SD change in X corresponds to ~3% of typical Y variation."

Phase 7 — Visualize results

Plot the result, not just the table:

For coefficients: forest plot of estimates with CIs.
For predictions: predicted vs actual, or partial dependence plots.
For groups: violin or strip plot with group means + CIs.
For time series: plot the fit overlaid on data with prediction intervals.
For models with interactions: plot the interaction.

Use matplotlib/seaborn (Python) or ggplot2 (R). Save figures as PDF or PNG at high DPI.

Phase 8 — Report

Produce an analysis report. In Claude Code, write analysis_<topic>.md in the working directory and save the script alongside (analysis.py / analysis.R). In claude.ai, render the report as an artifact and the script as a downloadable code file (or both as artifacts). Structure:

# Analysis: [Topic]

## 1. Question
## 2. Data
- Source, N, time range, variables used
- Cleaning decisions (and why)
## 3. Methods
- Statistical approach with citation
- Software + version
- Pre-specified vs exploratory
## 4. Results
- Table(s) of estimates with CIs
- Figure(s)
- Interpretation
## 5. Robustness
- Alternative specifications, sensitivity to outliers / missing data handling
## 6. Limitations
## 7. Conclusion
- Direct answer to the question, with appropriate hedging
## Appendix
- Full model output
- Diagnostic plots
- Path to script: [analysis.py / analysis.R]

Save the script alongside. The script + cleaned data + report should let anyone reproduce the result.

Phase 9 — For heavy datasets

When the dataset is large, computation is slow, or many model variants are needed:

In Claude Code: spawn the data-cruncher subagent to run the work in isolation and return summarized output. That keeps the main conversation focused on interpretation.
In claude.ai: use the analysis (code execution) tool inline. Work in steps, keep intermediate dataframes in the sandbox, summarize numerical results in chat rather than dumping full tables.

Common pitfalls to flag actively

Treating ordinal data as interval without justification.
Running parametric tests without checking distributional assumptions.
Reporting only the "best" of many models without correction or pre-registration.
Inferring causation from cross-sectional regression.
Excluding outliers without disclosing.
Confusing statistical significance with practical importance.
Reading too much into n < 30 in any subgroup.
Ignoring clustered data structure (treating clustered observations as independent).

Handoffs

Part of the research-co-pilot skill network. See docs/skill-network.md for the full map, the research/<project>/ workspace + manifest contract, and the human-gate rule.

Lifecycle position: Analysis — after data collection, before drafting.

Upstream (what this skill reads):

methodology-advisor → methodology_<study>.md — the pre-specified analysis plan; the creative AI/ML/Big Data extensions section often names the models to run.
survey-design → the fielded instrument (once data is in) — variable definitions and scoring keys.
The dataset itself.
At intake, check research/<project>/manifest.json for the methodology before asking for paths; reconcile against the filesystem.

Downstream (what this skill feeds):

manuscript-drafter — the Results section is sourced from analysis_<topic>.md.
stats-validator (subagent) — an independent second look on the analysis, no narrative contamination.

Chaining:

Claude Code: if no analysis plan exists, offer to invoke Skill(methodology-advisor) first (ask before chaining). For heavy compute, spawn the data-cruncher subagent. For an independent check, spawn stats-validator.
claude.ai: advise "run /methodology first" if the design is unsettled; run heavy work inline in the analysis sandbox.

Vault (see docs/research-vault.md):

Read at intake: facts — especially the recruited sample_size — and the analysis plan.
Write at output: record the analyzed N as a distinct fact (sample_size_analyzed, with a note on exclusions) — do not overwrite the recruited N; append every cleaning / exclusion / model-choice decision to decisions.md with rationale (the reproducibility spine the methods + reviewer-response draw from).
Flag-on-write: if your analyzed N differs from the vault's sample_size and there is no exclusion note explaining it, stop and ask whether it's a post-exclusion N (distinct fact) or a discrepancy — never silently emit a contradicting number.

Output to the vault: write analysis_<topic>.md (+ the reproducible script) into research/<project>/07-analysis/, register it in the manifest, advance stage to analysis.

data-analysis

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

data-analysis

Popularity

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Data Analysis — Cleaning, Stats, Modeling, Visualization

Hard rules

Phase 1 — Frame the question

Phase 2 — Load and inspect

Phase 3 — Clean

Phase 4 — EDA

Phase 5 — Model

Phase 6 — Interpret

Phase 7 — Visualize results

Phase 8 — Report

Phase 9 — For heavy datasets

Common pitfalls to flag actively

Handoffs

Similar Skills

Data Analysis — Cleaning, Stats, Modeling, Visualization

Hard rules

Phase 1 — Frame the question

Phase 2 — Load and inspect

Phase 3 — Clean

Phase 4 — EDA

Phase 5 — Model

Phase 6 — Interpret

Phase 7 — Visualize results

Phase 8 — Report

Phase 9 — For heavy datasets

Common pitfalls to flag actively

Handoffs

Similar Skills