Skill

fulltext-retrieval

Batch downloads open-access PDFs from a DOI list using Unpaywall, PMC, OpenAlex, and Crossref APIs. Converts PDFs to Markdown for LLM analysis.

Python

data-engineering

Popularity

Stars

148

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/medsci-project:fulltext-retrieval

User invocable

Model invocable

Inline context

Default effort

Configuration

Modelinherit

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Batch download open-access full-text PDFs from a DOI list using legitimate OA APIs only.

Supporting Files

fetch_oa.pypdf_to_md.pyskill.yml

SKILL.md

175 lines · ~1.4k tokens

Stats

LanguagePython

Stars148

Forks37

MaintenanceExcellent

Last CommitJun 14, 2026

Actions

View Source View Plugin View on GitHub View README

Fulltext Retrieval Skill

Batch download open-access full-text PDFs from a DOI list using legitimate OA APIs only.

Pipeline

DOI list → Unpaywall → PMC (Europe PMC / OA FTP / web) → OpenAlex → Crossref → landing page

Each DOI goes through these sources in order until a valid PDF (≥10 KB, %PDF- header) is found.

Quick Start

# Prepare a DOI list (one per line)
cat > dois.txt << 'EOF'
10.1007/s00330-010-1783-x
10.1002/mp.12524
10.1148/radiol.13131265
EOF

# Run
python fetch_oa.py dois.txt --output pdfs/ --email [email protected]

# Verbose mode for debugging
python fetch_oa.py dois.txt -o pdfs/ -e [email protected] --verbose

Input Formats

Plain text — one DOI per line:

10.1007/s00330-010-1783-x
10.1002/mp.12524

TSV with header — must contain a DOI column, optional PMID column:

ID	Title	DOI	PMID	Year
1	Some paper	10.1007/s00330-010-1783-x	20628747	2010

When a PMID is available, the PMC lookup is more reliable (PMID → PMCID conversion).

PMC Download (JS-Challenge Resistant)

PMC web pages may block automated downloads with JavaScript proof-of-work challenges. This tool uses three fallback methods:

Method A: Europe PMC REST API (most reliable)

PMCID="PMC9733600"
curl -sLo output.pdf \
  "https://europepmc.org/backend/ptpmcrender.fcgi?accid=${PMCID}&blobtype=pdf"

Method B: PMC OA FTP Service

curl -s "https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=${PMCID}" | \
    grep -oE 'href="[^"]*\.pdf"' | head -1 | \
    sed 's/href="//;s/"//' | xargs curl -sLo output.pdf

DOI/PMID → PMCID Conversion

# Works with both DOI and PMID
curl -s "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=${DOI}&format=json" | \
    python3 -c "import sys,json; print(json.load(sys.stdin)['records'][0].get('pmcid',''))"

Output

PDFs saved as {DOI_safe}.pdf (slashes replaced with underscores)
manual_needed.txt — DOIs that could not be retrieved via OA
Summary with OA/PMC/fail/skip counts

Requirements

Python 3.10+ (stdlib only, no pip dependencies)
Contact email (required by Unpaywall Terms of Service)

API Policies

Source	Rate Limit	Notes
Unpaywall	100 req/sec	Email required
NCBI PMC	3 req/sec without API key	Add `&api_key=` for higher limits
OpenAlex	100k req/day	Polite pool with email in User-Agent
Crossref	50 req/sec with email	Plus service with `mailto:` in UA
Europe PMC	No documented limit	Be polite, ≤1 req/sec recommended

The script uses 0.3–0.5 second delays between requests.

PDF → Markdown Conversion (Optional)

After downloading PDFs, convert them to LLM-friendly Markdown for token-efficient repeated analysis. Uses pymupdf4llm — optimized for academic papers with two-column layout handling and table preservation.

Quick Start

# Install (one-time)
pip install pymupdf4llm

# Convert all PDFs in a directory
python pdf_to_md.py pdfs/

# Convert with verbose output
python pdf_to_md.py pdfs/ -v

# Custom output directory
python pdf_to_md.py pdfs/ -o markdown/

# First 10 pages only (useful for long supplements)
python pdf_to_md.py pdfs/ --pages 0-9

# Overwrite existing conversions
python pdf_to_md.py pdfs/ --force

Combined Workflow

# Step 1: Download PDFs
python fetch_oa.py dois.txt -o pdfs/ -e [email protected]

# Step 2: Convert to Markdown (only successful downloads)
python pdf_to_md.py pdfs/ -v

After conversion, .md files sit alongside .pdf files. Claude Code can then use Read for full content or Grep for targeted extraction — significantly more token-efficient than re-reading PDFs.

When to Convert

Scenario	Recommendation
Screening/triage (read once)	Skip — read PDF directly
Data extraction from k≥5 studies	Convert — repeated reads save tokens
Meta-analysis full pipeline	Convert — papers referenced across multiple phases
Single paper deep review	Optional — marginal benefit

Academic Paper Defaults

Images: Skipped (saves tokens; figures referenced by caption text)
Tables: lines_strict strategy (preserves grid-line tables accurately)
Layout: Two-column academic layout handled automatically
Headers/footers: Removed by pymupdf4llm

Dependency Note

pdf_to_md.py requires pymupdf4llm (AGPL-3.0). This is an optional dependency — fetch_oa.py remains stdlib-only with zero external dependencies. The AGPL license applies to pymupdf4llm itself, not to this skill.

Limitations

Only retrieves open-access articles. Paywalled articles require institutional access.
Landing page scraping may fail on publisher-specific JavaScript-heavy pages.
Some recent articles may not yet be indexed by OA sources.
PDF→Markdown quality depends on the PDF's text layer. Scanned-only PDFs may produce poor output.

Anti-Hallucination

Never fabricate file paths, URLs, DOIs, or package names. Verify existence before recommending.
Never invent journal metadata, impact factors, or submission policies without verification at the journal's website.
If a tool, package, or resource does not exist or you are unsure, say so explicitly rather than guessing.

fulltext-retrieval

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

fulltext-retrieval

Popularity

Invocation

Configuration

Context Preview

Supporting Files

SKILL.md

Fulltext Retrieval Skill

Pipeline

Quick Start

Input Formats

PMC Download (JS-Challenge Resistant)

Method A: Europe PMC REST API (most reliable)

Method B: PMC OA FTP Service

DOI/PMID → PMCID Conversion

Output

Requirements

API Policies

PDF → Markdown Conversion (Optional)

Quick Start

Combined Workflow

When to Convert

Academic Paper Defaults

Dependency Note

Limitations

Anti-Hallucination

Similar Skills

Fulltext Retrieval Skill

Pipeline

Quick Start

Input Formats

PMC Download (JS-Challenge Resistant)

Method A: Europe PMC REST API (most reliable)

Method B: PMC OA FTP Service

DOI/PMID → PMCID Conversion

Output

Requirements

API Policies

PDF → Markdown Conversion (Optional)

Quick Start

Combined Workflow

When to Convert

Academic Paper Defaults

Dependency Note

Limitations

Anti-Hallucination

Similar Skills