Finds and evaluates datasets for a research question using parallel Explorer agents and a Critic that stress-tests candidates for feasibility.
How this skill is triggered — by the user, by Claude, or both
Slash command
/social-science-research:data-finder [research topic or 'from spec']When to use
find data, what data to use, find dataset, where to get data on X, assess datasets, what datasets exist, help find data, is there data, data options, need data for project
[research topic or 'from spec']This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Find and assess datasets for your research question. Two Explorer agents search in parallel across data source categories; an Explorer-Critic then stress-tests each candidate against the research design.
Find and assess datasets for your research question. Two Explorer agents search in parallel across data source categories; an Explorer-Critic then stress-tests each candidate against the research design.
Input: $ARGUMENTS — a topic, or from spec to read the research question from quality_reports/.
Find the most recent quality_reports/project_spec_*.md or quality_reports/specs/*.md — extract:
Read references/domain-profile.md if it exists — extract the Common Datasets section (domain-specific datasets to check first).
If no research spec exists, extract the variables and strategy from $ARGUMENTS directly. If the request is vague, ask: "What are the treatment and outcome variables, and what empirical strategy did you have in mind?"
Split the source categories between two Explorer agents to parallelize the search.
Explorer A — Institutional Data:
Task prompt: "You are an Explorer agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Domain datasets (check first): [list from domain-profile if available].
Your source categories to search:
1. Public microdata (CPS, ACS, NHIS, MEPS, SIPP, QWI)
2. Administrative data (Medicare/Medicaid, IRS, SSA, vital statistics, court records)
3. Survey panels (PSID, HRS, Add Health, NLSY97/79, BHPS/UKHLS)
For each dataset found, produce the full Explorer report format.
Follow the Explorer agent instructions."
Explorer B — Broader and Alternative Sources:
Task prompt: "You are an Explorer agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Domain datasets (check first): [list from domain-profile if available].
Your source categories to search:
1. International data (World Bank, OECD, Eurostat, IMF, IPUMS International)
2. Novel/alternative (satellite, web scraping, proprietary, RCT registries)
3. Any field-specific datasets not covered by Explorer A
For each dataset found, produce the full Explorer report format.
Follow the Explorer agent instructions."
After both Explorer agents complete, dispatch the Explorer-Critic with the full combined dataset list.
Task prompt: "You are an Explorer-Critic agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Here is the combined dataset list from the Explorer agents:
[paste all Explorer findings]
Apply the 5-point critique to each dataset:
1. Measurement validity
2. Sample selection
3. External validity
4. Identification compatibility
5. Known issues
Produce adjusted feasibility grades and deal-breaker flags.
Follow the Explorer-Critic agent instructions."
After the Explorer-Critic completes, compile the final ranked report:
Save to quality_reports/data_exploration_[sanitized_topic].md:
# Data Exploration: [Topic]
**Date:** [YYYY-MM-DD]
**Research question:** [one sentence]
**Empirical strategy:** [method]
**Variables sought:** Treatment = [X], Outcome = [Y], Controls = [list]
## Top Candidates (Grade A–B)
### 1. [Dataset Name] — Grade: A/B
**Provider:** [Name] | **Access:** [Public/Restricted/etc.] | **URL:** [link]
**Coverage:** [time period] | [geography] | [unit of observation] | N ≈ [size]
**Key Variables:**
- Treatment proxy: [variable]
- Outcome: [variable]
- Controls available: [list]
**Explorer-Critic Assessment:**
- Measurement validity: [1-2 sentences]
- Sample selection: [1-2 sentences]
- External validity: [1-2 sentences]
- Identification compatibility: [focused on the proposed strategy]
- Known issues: [specific documented problems]
**Bottom line:** [1-2 sentences — viable and under what conditions]
[Repeat for all A and B grade datasets]
## Accessible With Effort (Grade C)
[Brief summaries — name, access path, main limitation, why C not B]
## Rejection Table
| Dataset | Reason for Rejection | Deal-breaker? |
|---------|---------------------|---------------|
| [Name] | [Explorer-Critic finding] | YES/NO |
## Recommended Path Forward
1. **Best dataset:** [Name] — [one sentence why]
2. **Fallback if [best] unavailable:** [Name] — [why it's second choice]
3. **Access path for [best]:** [download URL, application URL, IRB requirements, restricted-data steps]
### Ingest Recipe for [best]
A copy-pasteable load-and-clean block for the recommended dataset. Tailor to whether the source is a flat file, an API/portal, or a non-tabular source (PDF, scraped HTML).
**R:**
```r
# Download (or login + download — note in a comment if manual)
library(tidyverse)
df <- readr::read_csv("data/raw/[file].csv") # or haven::read_dta, arrow::read_parquet
df_clean <- df %>%
rename(...) %>%
mutate(...) %>%
filter(...)
arrow::write_parquet(df_clean, "data/processed/[name].parquet")
Python:
import pandas as pd
df = pd.read_csv("data/raw/[file].csv") # or pd.read_stata, pd.read_parquet
df_clean = (
df.rename(columns={...})
.assign(...)
.query("...")
)
df_clean.to_parquet("data/processed/[name].parquet")
If the source is a PDF, scraped HTML page, or government portal API, replace the load step with the appropriate extraction recipe — tabulizer/pdftools/rvest/tidycensus/fredr in R, or pdfplumber/camelot/pandas.read_html/census/fredapi in Python. The /data-analysis skill's Phase 0.5 documents the full set; cross-reference it if extraction is non-trivial.
srvyr::as_survey_design()]/data-analysis [dataset] — begin analysis with the recommended dataset (Phase 0.5 handles non-tabular sources; Phase 3.5 runs design-specific identification diagnostics)/lit-review [topic] — check if papers in the literature use these datasets (helps validate choice)
## Important
- **Identification compatibility is the most important criterion.** A perfectly accessible dataset that can't support the proposed empirical strategy is useless. The Explorer-Critic's grade on this dimension should drive the recommendation.
- **Access level affects timeline.** An FSRDC dataset may take 1-2 years to access. A public download can start today. Make this tradeoff explicit.
- **Don't reject C-grade datasets outright.** A FSRDC dataset with perfect identification fit may be the right choice for a dissertation. Present the access path clearly.
npx claudepluginhub felpix-studios/social-science-research --plugin social-science-researchGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.