From medsci-project
Batch downloads open-access PDFs from a DOI list using Unpaywall, PMC, OpenAlex, and Crossref APIs. Converts PDFs to Markdown for LLM analysis.
How this skill is triggered — by the user, by Claude, or both
Slash command
/medsci-project:fulltext-retrievalinheritThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Batch download open-access full-text PDFs from a DOI list using legitimate OA APIs only.
Batch download open-access full-text PDFs from a DOI list using legitimate OA APIs only.
DOI list → Unpaywall → PMC (Europe PMC / OA FTP / web) → OpenAlex → Crossref → landing page
Each DOI goes through these sources in order until a valid PDF (≥10 KB, %PDF- header) is found.
# Prepare a DOI list (one per line)
cat > dois.txt << 'EOF'
10.1007/s00330-010-1783-x
10.1002/mp.12524
10.1148/radiol.13131265
EOF
# Run
python fetch_oa.py dois.txt --output pdfs/ --email [email protected]
# Verbose mode for debugging
python fetch_oa.py dois.txt -o pdfs/ -e [email protected] --verbose
Plain text — one DOI per line:
10.1007/s00330-010-1783-x
10.1002/mp.12524
TSV with header — must contain a DOI column, optional PMID column:
ID Title DOI PMID Year
1 Some paper 10.1007/s00330-010-1783-x 20628747 2010
When a PMID is available, the PMC lookup is more reliable (PMID → PMCID conversion).
PMC web pages may block automated downloads with JavaScript proof-of-work challenges. This tool uses three fallback methods:
PMCID="PMC9733600"
curl -sLo output.pdf \
"https://europepmc.org/backend/ptpmcrender.fcgi?accid=${PMCID}&blobtype=pdf"
curl -s "https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=${PMCID}" | \
grep -oE 'href="[^"]*\.pdf"' | head -1 | \
sed 's/href="//;s/"//' | xargs curl -sLo output.pdf
# Works with both DOI and PMID
curl -s "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=${DOI}&format=json" | \
python3 -c "import sys,json; print(json.load(sys.stdin)['records'][0].get('pmcid',''))"
{DOI_safe}.pdf (slashes replaced with underscores)manual_needed.txt — DOIs that could not be retrieved via OA| Source | Rate Limit | Notes |
|---|---|---|
| Unpaywall | 100 req/sec | Email required |
| NCBI PMC | 3 req/sec without API key | Add &api_key= for higher limits |
| OpenAlex | 100k req/day | Polite pool with email in User-Agent |
| Crossref | 50 req/sec with email | Plus service with mailto: in UA |
| Europe PMC | No documented limit | Be polite, ≤1 req/sec recommended |
The script uses 0.3–0.5 second delays between requests.
After downloading PDFs, convert them to LLM-friendly Markdown for token-efficient repeated analysis. Uses pymupdf4llm — optimized for academic papers with two-column layout handling and table preservation.
# Install (one-time)
pip install pymupdf4llm
# Convert all PDFs in a directory
python pdf_to_md.py pdfs/
# Convert with verbose output
python pdf_to_md.py pdfs/ -v
# Custom output directory
python pdf_to_md.py pdfs/ -o markdown/
# First 10 pages only (useful for long supplements)
python pdf_to_md.py pdfs/ --pages 0-9
# Overwrite existing conversions
python pdf_to_md.py pdfs/ --force
# Step 1: Download PDFs
python fetch_oa.py dois.txt -o pdfs/ -e [email protected]
# Step 2: Convert to Markdown (only successful downloads)
python pdf_to_md.py pdfs/ -v
After conversion, .md files sit alongside .pdf files. Claude Code can then use Read for full content or Grep for targeted extraction — significantly more token-efficient than re-reading PDFs.
| Scenario | Recommendation |
|---|---|
| Screening/triage (read once) | Skip — read PDF directly |
| Data extraction from k≥5 studies | Convert — repeated reads save tokens |
| Meta-analysis full pipeline | Convert — papers referenced across multiple phases |
| Single paper deep review | Optional — marginal benefit |
lines_strict strategy (preserves grid-line tables accurately)pdf_to_md.py requires pymupdf4llm (AGPL-3.0). This is an optional dependency — fetch_oa.py remains stdlib-only with zero external dependencies. The AGPL license applies to pymupdf4llm itself, not to this skill.
npx claudepluginhub aperivue/medsci-skills --plugin medsci-projectUse this skill for "search for papers", "find citations", "look up a DOI", "get BibTeX", "download PDF", "convert PDF to markdown", "find canonical papers", "convert identifiers", "batch download papers", "configure opencite", "literature review", "find related papers", "what papers cite this", "export references", "read this paper", or mentions of opencite, Semantic Scholar, OpenAlex, PubMed, academic literature search, citation management, or paper retrieval.
Downloads academic references (arXiv IDs or DOIs) into a sci-brain knowledge base: fetches metadata, PDFs, renders to markdown, updates INDEX.md and ref.bib.
Uses Unpaywall API to find free full-text open access versions of paywalled academic papers by DOI. Useful when direct DOI resolution, publisher sites, or PMC fail.