Skill

data-finder

Finds and evaluates datasets for a research question using parallel Explorer agents and a Critic that stress-tests candidates for feasibility.

data-engineering

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/social-science-research:data-finder [research topic or 'from spec']

User invocable

Model invocable

Inline context

Default effort

When to use

find data, what data to use, find dataset, where to get data on X, assess datasets, what datasets exist, help find data, is there data, data options, need data for project

Argument hint[research topic or 'from spec']

Tool Access

This skill is limited to the following tools:

ReadGrepGlobWriteWebSearchWebFetchTask

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Find and assess datasets for your research question. Two Explorer agents search in parallel across data source categories; an Explorer-Critic then stress-tests each candidate against the research design.

SKILL.md

213 lines · ~2k tokens

Stats

LanguagePython

Stars7

Forks1

MaintenanceExcellent

Last CommitMay 26, 2026

Actions

View Source View Plugin View on GitHub View README

Data Finder

Input: $ARGUMENTS — a topic, or from spec to read the research question from quality_reports/.

Step 1: Read Research Context

Find the most recent quality_reports/project_spec_*.md or quality_reports/specs/*.md — extract:
- Research question
- Empirical strategy (DiD, RDD, IV, etc.)
- Treatment variable (what varies)
- Outcome variable (what we measure)
- Controls needed
- Time period of interest
- Geography (national, state, county, individual)
- Unit of observation (individual, household, firm, establishment)
Read references/domain-profile.md if it exists — extract the Common Datasets section (domain-specific datasets to check first).
If no research spec exists, extract the variables and strategy from $ARGUMENTS directly. If the request is vague, ask: "What are the treatment and outcome variables, and what empirical strategy did you have in mind?"

Step 2: Dispatch Two Explorer Agents in Parallel

Split the source categories between two Explorer agents to parallelize the search.

Explorer A — Institutional Data:

Task prompt: "You are an Explorer agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Domain datasets (check first): [list from domain-profile if available].

Your source categories to search:
1. Public microdata (CPS, ACS, NHIS, MEPS, SIPP, QWI)
2. Administrative data (Medicare/Medicaid, IRS, SSA, vital statistics, court records)
3. Survey panels (PSID, HRS, Add Health, NLSY97/79, BHPS/UKHLS)

For each dataset found, produce the full Explorer report format.
Follow the Explorer agent instructions."

Explorer B — Broader and Alternative Sources:

Task prompt: "You are an Explorer agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].
Domain datasets (check first): [list from domain-profile if available].

Your source categories to search:
1. International data (World Bank, OECD, Eurostat, IMF, IPUMS International)
2. Novel/alternative (satellite, web scraping, proprietary, RCT registries)
3. Any field-specific datasets not covered by Explorer A

For each dataset found, produce the full Explorer report format.
Follow the Explorer agent instructions."

Step 3: Dispatch Explorer-Critic

After both Explorer agents complete, dispatch the Explorer-Critic with the full combined dataset list.

Task prompt: "You are an Explorer-Critic agent. Research question: [question].
Empirical strategy: [strategy].
Variables needed — Treatment: [X], Outcome: [Y], Controls: [list],
Time period: [period], Geography: [geo], Unit: [unit].

Here is the combined dataset list from the Explorer agents:
[paste all Explorer findings]

Apply the 5-point critique to each dataset:
1. Measurement validity
2. Sample selection
3. External validity
4. Identification compatibility
5. Known issues

Produce adjusted feasibility grades and deal-breaker flags.
Follow the Explorer-Critic agent instructions."

Step 4: Produce Ranked Output

After the Explorer-Critic completes, compile the final ranked report:

Sort datasets by adjusted feasibility grade (A first, then B, then C, then D).
Within each grade, sort by identification compatibility score (highest first).
Separate out deal-breaker datasets into the rejection table.

Step 5: Save Report

Save to quality_reports/data_exploration_[sanitized_topic].md:

# Data Exploration: [Topic]

**Date:** [YYYY-MM-DD]
**Research question:** [one sentence]
**Empirical strategy:** [method]
**Variables sought:** Treatment = [X], Outcome = [Y], Controls = [list]


## Top Candidates (Grade A–B)

### 1. [Dataset Name] — Grade: A/B

**Provider:** [Name] | **Access:** [Public/Restricted/etc.] | **URL:** [link]

**Coverage:** [time period] | [geography] | [unit of observation] | N ≈ [size]

**Key Variables:**
- Treatment proxy: [variable]
- Outcome: [variable]
- Controls available: [list]

**Explorer-Critic Assessment:**
- Measurement validity: [1-2 sentences]
- Sample selection: [1-2 sentences]
- External validity: [1-2 sentences]
- Identification compatibility: [focused on the proposed strategy]
- Known issues: [specific documented problems]

**Bottom line:** [1-2 sentences — viable and under what conditions]


[Repeat for all A and B grade datasets]


## Accessible With Effort (Grade C)

[Brief summaries — name, access path, main limitation, why C not B]


## Rejection Table

| Dataset | Reason for Rejection | Deal-breaker? |
|---------|---------------------|---------------|
| [Name] | [Explorer-Critic finding] | YES/NO |


## Recommended Path Forward

1. **Best dataset:** [Name] — [one sentence why]
2. **Fallback if [best] unavailable:** [Name] — [why it's second choice]
3. **Access path for [best]:** [download URL, application URL, IRB requirements, restricted-data steps]

### Ingest Recipe for [best]

A copy-pasteable load-and-clean block for the recommended dataset. Tailor to whether the source is a flat file, an API/portal, or a non-tabular source (PDF, scraped HTML).

**R:**
```r
# Download (or login + download — note in a comment if manual)
library(tidyverse)
df <- readr::read_csv("data/raw/[file].csv")  # or haven::read_dta, arrow::read_parquet

df_clean <- df %>%
  rename(...) %>%
  mutate(...) %>%
  filter(...)

arrow::write_parquet(df_clean, "data/processed/[name].parquet")

Python:

import pandas as pd
df = pd.read_csv("data/raw/[file].csv")  # or pd.read_stata, pd.read_parquet
df_clean = (
    df.rename(columns={...})
      .assign(...)
      .query("...")
)
df_clean.to_parquet("data/processed/[name].parquet")

If the source is a PDF, scraped HTML page, or government portal API, replace the load step with the appropriate extraction recipe — tabulizer/pdftools/rvest/tidycensus/fredr in R, or pdfplumber/camelot/pandas.read_html/census/fredapi in Python. The /data-analysis skill's Phase 0.5 documents the full set; cross-reference it if extraction is non-trivial.

Known traps for [best]

[E.g., FIPS codes read as integers — drop leading zeros]
[E.g., Survey weights required — srvyr::as_survey_design()]
[E.g., Top-coding on income — note threshold]

Next Steps

/data-analysis [dataset] — begin analysis with the recommended dataset (Phase 0.5 handles non-tabular sources; Phase 3.5 runs design-specific identification diagnostics)
/lit-review [topic] — check if papers in the literature use these datasets (helps validate choice)



## Important

- **Identification compatibility is the most important criterion.** A perfectly accessible dataset that can't support the proposed empirical strategy is useless. The Explorer-Critic's grade on this dimension should drive the recommendation.
- **Access level affects timeline.** An FSRDC dataset may take 1-2 years to access. A public download can start today. Make this tradeoff explicit.
- **Don't reject C-grade datasets outright.** A FSRDC dataset with perfect identification fit may be the right choice for a dissertation. Present the access path clearly.

data-finder

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

data-finder

Popularity

Invocation

Tool Access

Context Preview

SKILL.md

Data Finder

Step 1: Read Research Context

Step 2: Dispatch Two Explorer Agents in Parallel

Step 3: Dispatch Explorer-Critic

Step 4: Produce Ranked Output

Step 5: Save Report

Known traps for [best]

Next Steps

Similar Skills

Data Finder

Step 1: Read Research Context

Step 2: Dispatch Two Explorer Agents in Parallel

Step 3: Dispatch Explorer-Critic

Step 4: Produce Ranked Output

Step 5: Save Report

Known traps for [best]

Next Steps

Similar Skills