Skill

download-ref

Downloads academic references (arXiv IDs or DOIs) into a sci-brain knowledge base: fetches metadata, PDFs, renders to markdown, updates INDEX.md and ref.bib.

automation

Popularity

Stars

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/sci-brain:download-ref

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- A discussion / draft surfaces a paper not yet in the project KB, and you want it indexed for future search.

Supporting Files

helpers/append_bibtex.pyhelpers/bibtex_to_manifest.pyhelpers/fetch_metadata.pyhelpers/index.pyhelpers/render.pyhelpers/resolve_kb.py

SKILL.md

262 lines · ~2.7k tokens

Stats

LanguagePython

Stars50

Forks8

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

download-ref

When to use

A discussion / draft surfaces a paper not yet in the project KB, and you want it indexed for future search.
The user says "add this ref to the KB", "download arXiv:XXXX", "pull this DOI".
Bulk-importing a reading list from issue threads / chat history / a ref.bib.

Do NOT use:

For GitHub repos / web pages — those are too varied for a single-shot helper.

Preflight (run once per machine)

The renderer uses pymupdf4llm for highest-fidelity output (preserves figures). Fallbacks (markitdown → pdftotext) are text-only — figures silently missing. Verify before fetching:

python3 -c "import pymupdf4llm; print('ok', pymupdf4llm.__version__)"

If that errors, install for the same python3 the helpers will use:

# macOS / Homebrew Python
/opt/homebrew/bin/python3 -m pip install --user --break-system-packages pymupdf4llm

# Linux / system Python
python3 -m pip install --user pymupdf4llm

Inputs

One or more arXiv IDs (e.g. 1806.08734, 2006.10739) — strip the vN suffix.
One or more DOIs (e.g. 10.1103/PhysRevLett.130.036401) — lowercase preferred; renderer normalizes.
KB path — see Step 1.

Files this skill owns vs. doesn't

download-ref writes:

$KB/.raw/{arxiv,doi}/<id>.{json,pdf}
$KB/.figures/{arxiv__<id>,doi__<safe>}/...
$KB/<id>_<slug>.md (rendered paper, one per ref)
$KB/INDEX.md (regenerated each run)
Appends entries to $(dirname $KB)/ref.bib

download-ref never touches:

$KB/NOTES.md — owned by survey / researchstyle / humans (sub-themes, open problems, bottlenecks).

Workflow

1. Resolve the KB

If the caller passes --kb <abs-path>, use that. Otherwise:

KB=$(python3 skills/download-ref/helpers/resolve_kb.py)
if [ -z "$KB" ]; then
  # resolve_kb printed "unresolvable from ..." to stderr and exited 2.
  # Ask the user via AskUserQuestion where the KB should live.
  exit 1
fi

For advisor flows (/incarnate, /brainstorm-ideas with a selected advisor), resolve the advisor KB instead: KB=$(python3 skills/download-ref/helpers/resolve_kb.py --advisor <slug>). This honors $SCIBRAIN_KB_DIRNAME the same way the project-KB form does.

2. Confirm the refs aren't already present

for id in 1806.08734 2006.10739; do
  [ -f "$KB/.raw/arxiv/$id.json" ] && echo "$id present" || echo "$id missing"
done
for doi in 10.1103/PhysRevLett.130.036401; do
  safe=$(echo "$doi" | tr '/' '-')
  [ -f "$KB/.raw/doi/$safe.json" ] && echo "$doi present" || echo "$doi missing"
done

Helpers are idempotent — this check is for human-readable status, not gating.

3. Build a manifest

3a. Direct input (single-shot mode):

TMP=/tmp/download-ref-manifest.json
cat > "$TMP" <<'EOF'
{"arxiv": ["1806.08734", "2006.10739"], "doi": []}
EOF

3b. From an existing ref.bib (bulk mode, --from-bib):

TMP=/tmp/download-ref-manifest.json
python3 skills/download-ref/helpers/bibtex_to_manifest.py "$(dirname $KB)/ref.bib" > "$TMP"

When in bulk mode, optionally ask the user:

"I see 59 refs in the manifest. Render all, topic-filtered, or specific IDs?"

(a) All — proceed with the full manifest

(b) Topic-filtered — name a heading from NOTES.md (skill greps for cite keys under it)

(c) Specific IDs — paste arXiv IDs / DOIs

For (b) and (c), edit $TMP accordingly before continuing.

4. Fetch metadata + arXiv PDFs

python3 skills/download-ref/helpers/fetch_metadata.py \
  --kb "$KB" \
  --manifest "$TMP" \
  --download-arxiv-pdfs

Populates $KB/.raw/{arxiv,doi}/<id>.{json,pdf} idempotently. PDFs are downloaded sequentially with 2s sleep between requests to avoid arXiv rate limits. Each PDF is verified for a %%EOF trailer; truncated downloads are discarded and retried. For DOIs whose publisher gates the PDF (APS / Nature / IOP / AAAS / ACS), the helper falls back to the arXiv preprint via externalIds.ArXiv when present. If even that fails, you'll see a miss line — go to Step 4b.

Tip: Set SEMANTIC_SCHOLAR_API_KEY in your environment to raise the Semantic Scholar rate limit from ~1 req/s to 100 req/s. Get a free key at https://www.semanticscholar.org/product/api#api-key-form.

4b. SciHub fallback for paywalled PDFs

If Step 4 reports miss for any DOI (no open-access PDF and no arXiv preprint), use the sci-hub-server MCP tool. For each missing DOI:

Call mcp__sci-hub-server__get_paper_link with the DOI to get a direct PDF URL.
Call mcp__sci-hub-server__download_pdf and save to $KB/.raw/doi/<safe>.pdf (<safe> = DOI with / → -).
Verify the file exists and is > 1 KB.

If the MCP isn't configured, tell the user to add it:

"mcpServers": {
  "sci-hub-server": {
    "command": "npx",
    "args": ["sci-mcp-server"]
  }
}

Skip this step if all PDFs were fetched in Step 4.

5. Render PDF to markdown

python3 skills/download-ref/helpers/render.py --kb "$KB"

Add --only-missing to skip papers that already have a rendered .md file (>500 bytes). This is much faster when adding a few papers to a large KB:

python3 skills/download-ref/helpers/render.py --kb "$KB" --only-missing

No manifest needed — renderer auto-discovers .raw/{arxiv,doi}/*.json. Renders new entries; overwrites existing.

PDF backend priority:

pymupdf4llm — markdown + extracted images into $KB/.figures/.
markitdown — text-only fallback.
pdftotext -layout — last-resort fallback.

.raw/ and .figures/ should stay out of git. Append to .gitignore if missing.

6. Propose + confirm cite key (per ref, single-shot mode only)

In single-shot mode (Step 3a), ask the user to confirm each new cite key. In bulk mode (Step 3b), the keys come from ref.bib directly — skip this step.

python3 skills/download-ref/helpers/append_bibtex.py propose \
  --kb "$KB" --id 1806.08734 --type arxiv

Output JSON has proposed_key (form lastname_year_firstkeyword), title, authors, year, bibtex_with_proposed_key. Show the user via AskUserQuestion:

Accept the proposed key
Use a custom key (free-text)
Skip this entry

Once confirmed:

python3 skills/download-ref/helpers/append_bibtex.py append \
  --kb "$KB" --id 1806.08734 --type arxiv \
  --key rahaman_2018_spectral \
  --bib "$(dirname $KB)/ref.bib"

The helper rewrites the BibTeX cite key, refuses duplicates, appends with one blank-line separator.

7. Regenerate INDEX.md

python3 skills/download-ref/helpers/index.py \
  --kb "$KB" \
  --title "<project-or-advisor-slug> — references" \
  --source-note "Reading list and full-text harness."

Replace <project-or-advisor-slug> with this KB's name. Once chosen, keep --title and --source-note byte-identical across runs — INDEX.md is regenerated wholesale every time; drift causes noisy diffs.

8. Verify and report

# New md files appear at top level
ls -t "$KB"/*.md | head
# Frontmatter present
for f in "$KB"/*.md; do
  case "$(basename "$f")" in INDEX.md|NOTES.md) continue ;; esac
  head -1 "$f" | grep -q '^---$' || echo "MISSING FRONTMATTER: $f"
done
# Raw blobs gitignored
KB_NAME=$(basename "$KB")
git -C "$(dirname "$KB")" check-ignore "$KB_NAME/.raw/" 2>/dev/null \
  || echo "WARN: $KB_NAME/.raw/ not gitignored"
# INDEX picked up the new ids
for id in 1806.08734 2006.10739; do
  grep -q "$id" "$KB/INDEX.md" || echo "WARN: $id missing from INDEX.md"
done

Tell the user: new cite key(s), rendered file path(s), full_text yes/no per ref.

After download — transition checkpoint

After the done checklist passes, offer the next step:

"Papers downloaded and rendered. What next?"

(a) Write a review — invokes survey-writer to produce a technology assessment from the active KB

(b) Ideas — continue to brainstorming with /brainstorm-ideas

(c) Done — stop here

The natural pipeline is: /survey → /download-ref → /survey-writer.

Integration with other skills

/survey (upstream): writes/extends $KB/NOTES.md, appends to $(dirname $KB)/ref.bib, regenerates $KB/INDEX.md, then hands off to /download-ref to fetch PDFs and render full text. The survey's transition checkpoint offers this directly.
/survey-writer (downstream): consumes the rendered KB (full-text .md files + $(dirname $KB)/ref.bib) to produce a structured technology assessment report.
/survey / /researchstyle: write their own .raw/ JSON via batched fetches and call append_bibtex.py directly (skipping the per-ref confirmation in Step 6). They invoke index.py at the end of their run.
/brainstorm-ideas end-of-session: surfaces candidate IDs/DOIs from the conversation; for the user's selections, invokes /download-ref in single-shot mode.
/incarnate: invokes /download-ref (or /researchstyle) targeting the advisor KB resolved by python3 skills/download-ref/helpers/resolve_kb.py --advisor <slug>.

Common mistakes

Mistake	Fix
Passing a relative `--kb`	Always absolute. Helpers don't `cd`; figures depend on absolute paths.
Forgetting `--download-arxiv-pdfs` in Step 4	Without it `full_text: no` and Step 5 has nothing to render.
Using `arXiv:XXXX` with prefix or `vN` suffix	Strip both — manifest takes bare ids: `1806.08734`.
Editing the rendered `.md` and losing it on re-render	Renderer overwrites without warning. Edit `.raw/` source or renderer logic.
Cite-key collision with different content	Helper skips silently — investigate, re-run propose with a different key.
Drifting `--title` / `--source-note` between runs	`INDEX.md` regenerates wholesale; first-run values are canonical. Copy verbatim from existing `INDEX.md`.

Done checklist

.raw/{arxiv,doi}/<id>.json exists for every requested id
.raw/{arxiv,doi}/<id>.pdf exists where the source allows (else recorded as miss)
One new <id>_<slug>.md per ref at $KB/ root, with frontmatter
$KB/INDEX.md regenerated, lists each new entry
$(dirname $KB)/ref.bib has the new cite key (no duplicate)
User told cite keys, file names, and full_text yes/no per ref

download-ref

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

download-ref

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

download-ref

When to use

Preflight (run once per machine)

Inputs

Files this skill owns vs. doesn't

Workflow

1. Resolve the KB

2. Confirm the refs aren't already present

3. Build a manifest

4. Fetch metadata + arXiv PDFs

4b. SciHub fallback for paywalled PDFs

5. Render PDF to markdown

6. Propose + confirm cite key (per ref, single-shot mode only)

7. Regenerate INDEX.md

8. Verify and report

After download — transition checkpoint

Integration with other skills

Common mistakes

Done checklist

Similar Skills

download-ref

When to use

Preflight (run once per machine)

Inputs

Files this skill owns vs. doesn't

Workflow

1. Resolve the KB

2. Confirm the refs aren't already present

3. Build a manifest

4. Fetch metadata + arXiv PDFs

4b. SciHub fallback for paywalled PDFs

5. Render PDF to markdown

6. Propose + confirm cite key (per ref, single-shot mode only)

7. Regenerate INDEX.md

8. Verify and report

After download — transition checkpoint

Integration with other skills

Common mistakes

Done checklist

Similar Skills