From paper-trail
Acquire PDFs for bibliographic references via a strict 10-source cascade (Crossref OA → arXiv → OpenAlex → Unpaywall → HAL → CORE → archive.org → WebSearch queue; optionally Sci-Hub + Anna's Archive in opt-in mode). Each acquired PDF is validated against expected author/title/year (page 1 anti-homonymy) before being accepted. Trigger this skill whenever the user wants to download a PDF for a reference, fill a cascade, retry an acquisition, or push a `candidate`/`uid_resolved` ref forward in the FSM. Use also for `/paper-trail:cascade <slug>` and `/paper-trail:reactivate-ocr`. Triggers on French and English phrases: "télécharge le PDF", "lance la cascade", "acquérir les sources", "DL ce papier", "passer en pdf_acquired", "valider page 1", "reprise OCR", "download this paper", "acquire PDFs", "run cascade", "retry acquisition", "advance candidates". The skill never decides whether a citation is truthful — that is the curator's role (sota-auditor skill). It only executes the technical state transitions of the worker B.
How this skill is triggered — by the user, by Claude, or both
Slash command
/paper-trail:pdf-cascadeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Wraps the paper-trail worker B's acquisition cascade. Given a single
Wraps the paper-trail worker B's acquisition cascade. Given a single
reference slug or a state filter, it advances the matching refs from
candidate toward page1_validated through the FSM, with strict page 1
anti-homonymy validation.
Anchors all downloads in the local registry (pdf_path, pdf_sha256,
acquisition_attempts[]) so the curator can audit everything.
Trigger this skill for any of:
candidate or uid_resolved
refs forward/paper-trail:cascade,
/paper-trail:reactivate-ocr, or /paper-trail:statussota-writer sub-task needs PDFs acquired for its proposed candidatesDo NOT invoke for semantic decisions (is this citation correct?) — that
belongs to sota-auditor.
The skill delegates to the worker B Python CLI:
# Single ref by slug
python -m pipeline run --ref <slug>
# Batch by state filter
python -m pipeline run --state candidate --limit 50
# Dry-run (no mutation)
python -m pipeline run --state candidate --dry-run
# Reactivate refs waiting for OCR
python -m pipeline reactivate-ocr
The CLI invokes the 8-source cascade (or 10 sources if
RESEARCH_ENABLE_SHADOW_LIBS=1 — see DISCLAIMER.md). Each acquired
PDF must pass page 1 validation (author + title similarity ≥ 0.3 +
zero off-domain keywords) before being accepted into the registry.
1. Crossref OA (DOI-based, open-access metadata)
2. arXiv (preprints CS/math/physics/q-bio/q-fin/etc.)
3. OpenAlex (cross-domain academic graph)
4. Unpaywall (OA discovery, fallback)
5. HAL (Hyper Articles en Ligne, French academia)
6. CORE (UK-based open repository aggregator)
7. archive.org (digitized books and articles)
8. WebSearch queue (manual fallback — adds the ref to a queue for
human-driven search via Claude Code interactive)
If shadow libs are activated, sources 8 and 9 are inserted before WebSearch:
8. scihub_optin (Sci-Hub multi-mirror)
9. annas_archive_optin (Anna's Archive via scidb DOI + title search)
10. websearch
Every successful PDF download passes through _save_and_validate which:
pdftotextstate: page1_validated if all 3 pass; pdf_acquired if PDF
structure OK but text not extractable (likely scan, will trigger
OCR via awaiting_rtfm_ocr); quarantines if validation failsQuarantined PDFs go to _registry/_quarantine/<slug>_HOMONYM_*.pdf
with the suffix indicating the failure mode.
The CLI prints a session recap:
Récap session : planned=N done=N pending=N blocked=N skipped_terminal=N
And the doctor runs in the end (unless --no-doctor) to flag any
invariant violation introduced.
For each ref processed, the acquisition_attempts[] field is appended
in its registry file, providing a complete audit trail.
User: "télécharge le PDF de arnold_1982"
Skill: invokes python -m pipeline run --ref arnold_1982 -v
User: "lance la cascade sur les 30 prochaines candidates"
Skill: invokes python -m pipeline run --state candidate --limit 30
User: "qu'est-ce qui se passerait si je lançais sur tous les uid_resolved ?"
Skill: invokes python -m pipeline run --state uid_resolved --dry-run
User: "reprise OCR sur les awaiting_rtfm_ocr"
Skill: invokes python -m pipeline reactivate-ocr
cascade_exhausted : all sources failed → blocked_human:cascade_exhausted,
the ref needs human-driven action (e.g., contact author, institutional
access)title_mismatch : downloaded PDF doesn't match expected metadata →
quarantined + blocked_human:title_mismatchbreaker_open : a source had ≥ 5 consecutive failures, it's
temporarily disabled for this session (Couche 2 circuit-breaker)worker_crash : exception in a source's helper → logged in journal,
ref left in its previous stateEach is logged in acquisition_attempts[] with verdict and reason.
When a ref ends in blocked_human:cascade_exhausted_needs_manual, the
pipeline writes _registry/_hints/<slug>.md listing what was tried.
Don't grab the PDF outside the pipeline and dump it manually —
that breaks acquisition_attempts tracking, page 1 validation, and
quarantine. Use one of three proper entry points :
If a quick web search surfaces a working URL (HAL page, university deposit, NIME proceedings, author homepage), inject it via the slash command :
/paper-trail:inject-url <slug> https://example.edu/path/to/paper
This sets oa_url: in the ref frontmatter, unblocks the ref, re-runs
the cascade. The manual_oa_url source is at the head of the cascade
and benefits from the landing→PDF resolver (so a deposit page URL
works — no need for the direct PDF URL).
paper-search MCP download_with_fallbackFor paywalled-but-on-RG / scidb references, the paper-search MCP
handles ResearchGate, Sci-Hub fallback, and other anti-bot sites that
our internal cascade can't reach :
# Inside Claude Code, not bash :
mcp__paper-search__download_with_fallback(
title="<title>",
doi="<doi>",
save_dir="/tmp/paper-trail-recovered",
)
Then move the returned PDF to 10_SOURCES/<domain>/Sources/ and set
the ref's pdf_path to its relative path. The next pipeline run --ref <slug> will pick it up via the local-PDF step and apply page 1
validation.
Just place it in 10_SOURCES/<domain>/Sources/ and edit the ref's
frontmatter :
pdf_path: 11_Biblio_MIR/Sources/Author_Year_Title.pdf
Then python3 -m pipeline run --ref <slug> will validate page 1.
state: page1_validated in the frontmatter — homonymies will leak in._check_local_pdf from custom Python code — use the
three entry points above.npx claudepluginhub roomi-fields/paper-trailGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.