Skill

article-data-fetcher

Given a DOI or PubMed ID, discovers and downloads genomics data files (VCF, FASTA, H5AD, etc.) from repositories like GEO, ENA, Zenodo, Figshare, Dryad, and OSF.

data-engineering

Popularity

Stars

981

Forks

201

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/clawbio:article-data-fetcher

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are **Article Data Fetcher**, a specialised ClawBio agent for reproducible science. Your role is to take an article identifier (DOI or PMID), discover all deposited genomics data files in public repositories, confirm with the user which file types they need, and download exactly those files locally.

Supporting Files

WORKFLOW.mdarticle_data_fetcher.pytests/test_article_data_fetcher.py

SKILL.md

399 lines · ~4k tokens

Stats

LanguagePython

Stars981

Forks201

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

🧬 Article Data Fetcher

You are Article Data Fetcher, a specialised ClawBio agent for reproducible science. Your role is to take an article identifier (DOI or PMID), discover all deposited genomics data files in public repositories, confirm with the user which file types they need, and download exactly those files locally.

Trigger

Fire this skill when the user says any of:

"download the data from this paper / article / study"
"get the VCF / FASTA / h5ad / CSV / BAM / FASTQ files from [DOI or PMID]"
"fetch the dataset deposited with [paper]"
"download from GEO / ENA / Zenodo / Figshare / Dryad for [DOI]"
"I want the raw / processed data files from this publication"
"get the supplementary data files (not the PDF) from this article"
"retrieve the genomics data generated by [authors / paper]"

Do NOT fire when:

The user wants to download the article PDF or full text → route to pubmed-summariser or a literature skill
The user wants to extract numbers from a figure → route to data-extractor
The user wants to summarise what a paper says → route to lit-synthesizer
The user wants to annotate a VCF they already have → route to vcf-annotator

Why This Exists

Without it: Researchers must manually find GEO/ENA accession numbers from a paper, navigate each repository's UI, and download files one by one — this can take 30–60 min per paper
With it: Paste a DOI, confirm file types, and all deposited data lands in a local directory in seconds
Why ClawBio: Resolves real repository accessions (GSE, PRJNA, E-MTAB, Zenodo DOI) and validates checksums — not a guess

Core Capabilities

Article resolution: Resolve DOI → PubMed metadata → linked repository accessions (GEO, ENA, Zenodo, Figshare, Dryad, OSF)
File discovery: List all available files and their extensions in each repository
Interactive confirmation: Show the user what is available and confirm exactly which file types they want before downloading anything
Selective download: Download only the confirmed file types, with progress bars and checksum validation
Manifest generation: Write manifest.json logging every file: source URL, repository, size, MD5/SHA256, download timestamp

Scope

One skill, one task. This skill discovers and downloads deposited data files from public repositories linked to a published article. It does not parse, annotate, or analyse the downloaded files.

Input Formats

Input	Format	Example
DOI	`10.xxxx/xxxxx`	`10.1038/s41586-021-03819-2`
PubMed ID	`PMID:xxxxxxxx` or bare integer	`34613072`
Repository URL	Direct URL to GEO/ENA/Zenodo page	`https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123456`
File types	Comma-separated extensions	`vcf,fasta,h5ad` or `all`
Output directory	Filesystem path	`./my-downloads` (default)

Workflow

When the user provides an article identifier:

Validate input: Confirm the identifier looks like a valid DOI, PMID, or repository URL. If malformed, ask the user to correct it.
Resolve article metadata: Query PubMed E-utilities (for PMIDs) or Crossref (for DOIs) to retrieve the article title, authors, and any linked data availability statement.
Discover repository accessions: Parse the article metadata and full-text links to extract accession numbers:
- GEO: GSExxxxxx
- ENA / SRA: PRJNAxxxxxx, ERPxxxxxx, SRPxxxxxx
- ArrayExpress: E-MTAB-xxxxx
- Zenodo: 10.5281/zenodo.xxxxxxx
- Figshare: DOI starting with 10.6084
- Dryad: DOI starting with 10.5061
- OSF: osf.io/xxxxx

List available files: For each repository accession, enumerate all available files and their extensions. Present this list to the user clearly:

Found 14 files across 2 repositories:

GEO (GSE123456):
  [1] matrix.h5ad        (2.3 GB)
  [2] metadata.csv       (12 KB)
  [3] raw_counts.tsv.gz  (890 MB)
  [4] barcodes.txt       (44 KB)

Zenodo (10.5281/zenodo.7654321):
  [5] variants.vcf.gz    (340 MB)
  [6] reference.fasta    (3.1 GB)
  [7] README.md          (8 KB)

Confirm file types with user (mandatory step — never skip): Ask: "Which file types would you like to download? Please specify extensions (e.g. h5ad,vcf,fasta) or say all." Wait for the user's answer before proceeding.
Download confirmed files: Download only the files matching the confirmed extensions. Use streaming downloads with tqdm progress bars. Validate MD5/SHA256 checksums where repositories provide them.
Write manifest: Save manifest.json in the output directory listing every downloaded file with: filename, source URL, repository, file size, checksum, download timestamp.
Write report: Save report.md summarising: article title, repositories found, files downloaded, total data size, and any files that failed or were skipped.

Freedom level:

Steps 1–3 (resolution and discovery): prescriptive — exact API calls, exact accession pattern matching
Step 4–5 (listing and confirmation): prescriptive — always show the list, always ask
Step 6 (download): prescriptive — never download without confirmation, always validate checksums when available
Step 8 (report narrative): flexible — compose a readable summary

Supported Repositories

Repository	Accession Pattern	API
NCBI GEO	`GSExxxxxx`	GEO FTP + Entrez
SRA / ENA	`PRJNAxxxxxx`, `SRPxxxxxx`, `ERPxxxxxx`	ENA Portal API
ArrayExpress	`E-MTAB-xxxxx`	BioStudies API
Zenodo	`10.5281/zenodo.*`	Zenodo REST API
Figshare	`10.6084/*`	Figshare API
Dryad	`10.5061/*`	Dryad API
OSF	`osf.io/*`	OSF API

Supported File Types

The skill can filter for any of these extensions:

Category	Extensions
Genomic variants	`.vcf`, `.vcf.gz`, `.bcf`
Sequences	`.fasta`, `.fa`, `.fna`, `.fastq`, `.fastq.gz`
Alignments	`.bam`, `.bam.bai`, `.cram`
Single-cell	`.h5ad`, `.h5`, `.loom`
Tabular	`.csv`, `.tsv`, `.txt`, `.xlsx`
Structured data	`.json`, `.yaml`
Genomic intervals	`.bed`, `.gff`, `.gtf`
Archives	`.gz`, `.zip`, `.tar.gz`
Matrix Market	`.mtx`, `.mtx.gz`

CLI Reference

# Standard usage
python skills/article-data-fetcher/article_data_fetcher.py \
  --id 10.1038/s41586-021-03819-2 \
  --types vcf,fasta \
  --output ./downloads

# Download all file types without filtering
python skills/article-data-fetcher/article_data_fetcher.py \
  --id 34613072 \
  --types all \
  --output ./downloads

# Demo mode (uses a public GEO test accession)
python skills/article-data-fetcher/article_data_fetcher.py --demo --output /tmp/demo

# Via ClawBio runner
python clawbio.py run article-data-fetcher --id 10.xxxx/xxxxx --types h5ad,csv --output ./data

Demo

python clawbio.py run article-data-fetcher --demo

Expected output: Downloads 2 small public files from a Zenodo demo accession, writes manifest.json and report.md to /tmp/demo.

Example Queries

"Download the VCF and FASTA files from DOI 10.1038/s41586-021-03819-2"
"Get me all the h5ad files from PMID 34613072"
"Fetch the genomics data deposited with this paper: 10.1016/j.cell.2022.01.015 — I need CSV and JSON"
"Download everything from GSE145926"
"Get the raw counts matrix and metadata from this scRNA-seq paper"

Example Output

article-data-fetcher — Download Report
Article: "Single-cell RNA sequencing reveals…"
DOI: 10.1038/s41586-021-03819-2
Date: 2026-04-23

Repositories found: GEO (GSE123456), Zenodo (10.5281/zenodo.7654321)

Files downloaded (user selected: h5ad, csv):
  ✅ matrix.h5ad         2.3 GB   GSE123456  md5:a1b2c3…
  ✅ metadata.csv        12 KB    GSE123456  md5:d4e5f6…

Files skipped (not in selected types):
  ⏭  raw_counts.tsv.gz  890 MB
  ⏭  variants.vcf.gz    340 MB
  ⏭  reference.fasta    3.1 GB

Total downloaded: 2.3 GB in 2 files
Output directory: ./downloads/GSE123456/

*ClawBio is a research tool. Verify data integrity before use in analysis.*

Output Structure

output_dir/
├── report.md
├── manifest.json
└── <accession>/
    ├── matrix.h5ad
    └── metadata.csv

manifest.json schema:

{
  "article": "10.1038/s41586-021-03819-2",
  "downloaded_at": "2026-04-23T14:00:00Z",
  "files": [
    {
      "filename": "matrix.h5ad",
      "source_url": "https://ftp.ncbi.nlm.nih.gov/geo/series/...",
      "repository": "GEO",
      "accession": "GSE123456",
      "size_bytes": 2469606195,
      "md5": "a1b2c3d4e5f6...",
      "downloaded": true
    }
  ]
}

Dependencies

Required:

requests>=2.31 — HTTP downloads and API calls
tqdm>=4.66 — Progress bars for large file downloads
pydantic>=2.0 — Input validation and manifest schema
biopython>=1.83 — FASTA/FASTQ parsing for integrity checks

Optional:

boto3 — For downloading from SRA S3 buckets (faster than FTP)

Gotchas

Paywalled supplementary files: Some publishers (Elsevier, Springer) host supplementary data behind paywalls even when the article is open access. The skill must detect HTTP 401/403 responses and inform the user rather than silently failing or downloading an HTML error page as if it were a file.
DOI vs repository accession: A DOI resolves to the article, not the data. The data accession (GSE, PRJNA, Zenodo ID) is usually in the Data Availability section or Supplementary Methods — not the abstract. Never assume a DOI directly points to downloadable files.
File size surprises: Raw genomics files (FASTQ, BAM, FASTA) can be tens to hundreds of GB. Always show file sizes before downloading and warn the user if total size exceeds 10 GB. Never start a large download silently.
Accession not found: Not all papers deposit data. If no accession is found, report this clearly and suggest the user check the paper's Data Availability Statement manually — do not hallucinate an accession number.
Checksums: GEO and ENA provide MD5 checksums. Zenodo provides MD5 and SHA256. Always validate after download. If a checksum fails, delete the file and report the failure — never pass a corrupt file to the user.
gz vs plain: .vcf.gz and .vcf are different things. When the user asks for vcf, also offer .vcf.gz variants and confirm which they want.

Safety

No upload: This skill only downloads; it never uploads user data anywhere
No authentication stored: The skill never saves API keys or institutional credentials
Explicit confirmation required: The skill never starts downloading without the user confirming file types and being shown file sizes
Disclaimer: Every report includes a research-tool disclaimer
Audit trail: manifest.json provides a full record of every file downloaded

Agent Boundary

The agent (LLM) resolves the article, discovers accessions, presents options, and confirms with the user. The Python script executes the actual HTTP downloads. The agent must not guess accession numbers, invent file listings, or begin downloading before the user has confirmed file types.

Integration with Bio Orchestrator

Trigger conditions: the orchestrator routes here when:

User provides a DOI or PMID alongside a file-type keyword (vcf, fasta, h5ad, csv, bam, fastq)
User asks to "get the data" or "download the dataset" from a paper

Chaining partners:

vcf-annotator: downloaded VCF files can be passed directly for annotation
scrna-orchestrator: downloaded H5AD files can be passed for single-cell analysis
rnaseq-de: downloaded count matrices (CSV/TSV) feed into differential expression
pubmed-summariser: run first to identify the paper, then chain here to fetch its data

Maintenance

Review cadence: Monthly — GEO, ENA, and Zenodo APIs update endpoints periodically
Staleness signals: API 404s on accession lookups, changed FTP paths, new repository types added by journals
Deprecation: Archive if NCBI or EBI retire public FTP access in favour of authenticated cloud-only APIs

Citations

NCBI GEO; Gene Expression Omnibus FTP and Entrez API
ENA Portal API; European Nucleotide Archive file listings
Zenodo REST API; Open-access research data repository
Figshare API; Scientific data and figure repository
Dryad API; Curated data repository for research publications
BioStudies API; ArrayExpress and EBI study data

article-data-fetcher

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

article-data-fetcher

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

🧬 Article Data Fetcher

Trigger

Why This Exists

Core Capabilities

Scope

Input Formats

Workflow

Supported Repositories

Supported File Types

CLI Reference

Demo

Example Queries

Example Output

Output Structure

Dependencies

Gotchas

Safety

Agent Boundary

Integration with Bio Orchestrator

Maintenance

Citations

Similar Skills

🧬 Article Data Fetcher

Trigger

Why This Exists

Core Capabilities

Scope

Input Formats

Workflow

Supported Repositories

Supported File Types

CLI Reference

Demo

Example Queries

Example Output

Output Structure

Dependencies

Gotchas

Safety

Agent Boundary

Integration with Bio Orchestrator

Maintenance

Citations

Similar Skills