Skill

paper-extract

Read a downloaded quant finance paper PDF and author note.md + metrics.json using judgment. Use when the user wants a TL;DR that captures the paper's contribution (not a copy-pasted abstract), reported performance metrics (Sharpe, max drawdown, annual return, volatility, Calmar, information ratio, win rate) validated against table values as well as narrative prose, the paper's core formulas verified by cross-checking Results-section citations, and paper-specific open questions needed for replication. Second stage of the paper-to-production pipeline; reads papers/<arxiv-id>/paper.pdf and writes note.md + metrics.json into the same directory. Bundled Python scripts prepare a structured context bundle that Claude then reasons over to produce the outputs.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/quant-paper-agent:paper-extract

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Turn a quant paper PDF into `note.md` (human synopsis) and `metrics.json` (machine hand-off for `paper-replicate`). Bundled Python scripts prepare a structured context bundle; Claude reads it and synthesizes the outputs — so the TL;DR captures the paper's contribution, metric extraction covers table values as well as narrative, and the replication open-questions are paper-specific.

SKILL.md

139 lines · ~2.2k tokens

Stats

LanguagePython

Stars0

MaintenanceGood

Last CommitApr 29, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

paper-extract

Turn a quant paper PDF into note.md (human synopsis) and metrics.json (machine hand-off for paper-replicate). Bundled Python scripts prepare a structured context bundle; Claude reads it and synthesizes the outputs — so the TL;DR captures the paper's contribution, metric extraction covers table values as well as narrative, and the replication open-questions are paper-specific.

Workflow

Prepare (Python) — prepare_context.py extracts per-section text, candidate formulas (heuristic), candidate metric hits (regex), and meta into a single context.json. Bounded by character caps so the bundle fits comfortably in your context window.
Synthesize (Claude) — You (the main agent) read context.json and author note.md + metrics.json directly via the Write tool, using the schema below. The Python candidates are recall hints, not final output — verify, augment, drop false positives, and fix obvious OCR damage on formulas.
Validate (Python) — validate_output.py schema-checks metrics.json so paper-replicate never gets malformed input.

When to use

A PDF exists at papers/<arxiv-id>/paper.pdf (from paper-search or user-supplied) and the user wants it parsed into structured fields.
The user asks for reported performance numbers, core formulas, or replication open questions.

Runtime

Requires Python ≥ 3.9 and pymupdf (pip install -r requirements.txt from the plugin root).

Step 1 — Prepare the context bundle

python scripts/prepare_context.py papers/<arxiv-id>/paper.pdf

Writes papers/<arxiv-id>/.extract/context.json. Useful flags:

--per-section-cap 8000 (default) — caps each section's text so the bundle stays focused.
--total-cap 40000 (default) — soft total-char ceiling; shrinks biggest sections first if exceeded.

The bundle contains:

meta — from paper-search's meta.json (title, authors, abstract, categories, published).
outline — canonical section list with page ranges and char counts.
sections — per-section text (abstract, introduction, data, methodology, results, conclusion, references). This is the material you read.
candidate_formulas / candidate_core_formulas — every formula-looking line the heuristic caught, plus its guess at core formulas (labels cited in both Methodology and Results).
candidate_metrics — regex hits for eight common metrics with page + 160-char context window.
candidate_data_period — pattern-matched "from YYYY to YYYY, daily/monthly/..." if present.

Large papers

If context.json comes back with total_section_chars > 30000 or many truncated sections, spawn an Explore sub-agent via the Task tool on the bundle, asking it to return the fields you need. Keeps the main context clean.

Step 2 — Synthesize (this is the Claude part)

Read papers/<arxiv-id>/.extract/context.json. Then author the two output files using the Write tool.

`metrics.json` — target schema

{
  "arxiv_id": "2403.12345",
  "reported_metrics": [
    {
      "name": "sharpe",
      "value": 1.42,
      "variant": "12m lookback, long-short",
      "unit": null,
      "page": 14,
      "context": "Table 3 reports an annualized Sharpe ratio of 1.42 for the 12-month lookback variant."
    }
  ],
  "core_formulas": [
    {
      "label": "Eq. 3",
      "section": "methodology",
      "page": 7,
      "text": "r_{i,t} = (1/L) * sum_{k=1..L} ret_{i,t-k}",
      "cited_in_results": true
    }
  ],
  "data_period": {"start": "1990-01", "end": "2020-12", "frequency": "daily"},
  "universe": ["US equities", "CRSP common shares", "NYSE/NASDAQ/AMEX"],
  "notes": "Authors specify returns are excess of 1-month T-bill; signal formed at month-end and held for 1 month."
}

Fields with null are allowed when the paper genuinely does not state the value — do NOT guess.

How to fill each field

reported_metrics — Start from candidate_metrics, then:

Add values the regex missed. Tables are the big one — regex reads the narrative but not the table grids. Scan the results section text for numeric patterns near metric names.
Drop false positives (e.g. "Sharpe (1966)" matching the sharpe pattern).
Fill variant with the specific configuration (lookback length, long-only / long-short, top decile, market-neutral, etc.). The regex writes crude variants; you write precise ones.
Fix unit — "18%" should be {"value": 18.0, "unit": "%"}, not {"value": 0.18, "unit": null}. Keep values in the paper's original unit; paper-replicate's compare step normalizes.
Shorten context to one sentence that identifies where the number comes from.

core_formulas — Start from candidate_core_formulas, then:

For each candidate, verify it is actually cited by the Results section. The heuristic sometimes misses citations; sometimes it flags a label that is not truly central.
Fix obvious math-font corruption (e.g. summation k=1..L should render as sum_{k=1..L}). Keep it ASCII-safe plain text — no LaTeX rendering attempts.
If a formula is clearly central but the heuristic missed it, add it. You have section text; the heuristic only had line-level regex.
Set cited_in_results: true only when the paper's Results section references the equation label.

data_period — Read the Data section text. Use the narrative (e.g. "our sample spans January 1990 through December 2020, sampled daily") in preference to the candidate_data_period regex output. Null if the paper really does not say.

universe — Extract specific, named universes from the Data section. "US equities" alone is weak; "CRSP common shares on NYSE/NASDAQ/AMEX, excluding stocks below $5" is useful. A list of 1-4 strings.

notes — One to three sentences capturing replication-critical details the four fields above cannot express: signal-formation timing, rebalancing cadence, return definition (excess / total / log), treatment of delisted or missing names.

`note.md` — target shape

Follow references/note_template.md. The critical section is TL;DR: it must capture the paper's contribution, not its abstract. See the template for an explicit before/after example.

Step 3 — Validate

python scripts/validate_output.py papers/<arxiv-id>/metrics.json

Runs a stdlib-only schema check (types, required fields, reasonable ranges). Warnings on empty arrays, unknown metric names. Use --strict to fail on warnings. If it complains, fix metrics.json and re-run before handing off to paper-replicate.

Individual stage scripts (for debugging)

python scripts/extract_text.py <paper.pdf>       # raw text + heading detection
python scripts/extract_formulas.py <paper.pdf>   # formula candidates
python scripts/extract_metrics.py <paper.pdf>    # regex metric hits

Each prints JSON to stdout.

Non-goals

No silent LaTeX rendering. PyMuPDF math extraction is lossy; core_formulas[].text is a plain-ASCII approximation that a human (or the paper-replicate planner) cross-checks against the PDF.
No hallucinated fields. If the paper does not state a data period, data_period is null. If no metric table exists, reported_metrics is []. Do not invent anchors paper-replicate will later measure against.
No PDF re-download. paper-search owns that.

Hard rules

The TL;DR is not the abstract. A TL;DR is what you tell a colleague who asks "what's new?" in a hallway — the contribution, the mechanism, the headline result. See references/note_template.md for the contrast.
Regex is a recall hint, not an oracle. Every candidate metric must be verified against its source paragraph / table before it lands in metrics.json.
Core formulas are what the paper uses, not what looks mathy. A page of algebraic intermediate steps is not "core"; the one equation that defines the signal is.

paper-extract

Invocation

Context Preview

SKILL.md

paper-extract

Invocation

Context Preview

SKILL.md

paper-extract

Workflow

When to use

Runtime

Step 1 — Prepare the context bundle

Large papers

Step 2 — Synthesize (this is the Claude part)

`metrics.json` — target schema

How to fill each field

`note.md` — target shape

Step 3 — Validate

Individual stage scripts (for debugging)

Non-goals

Hard rules

Similar Skills

paper-extract

Workflow

When to use

Runtime

Step 1 — Prepare the context bundle

Large papers

Step 2 — Synthesize (this is the Claude part)

`metrics.json` — target schema

How to fill each field

`note.md` — target shape

Step 3 — Validate

Individual stage scripts (for debugging)

Non-goals

Hard rules

Similar Skills

paper-extract

Invocation

Context Preview

SKILL.md

paper-extract

Invocation

Context Preview

SKILL.md

paper-extract

Workflow

When to use

Runtime

Step 1 — Prepare the context bundle

Large papers

Step 2 — Synthesize (this is the Claude part)

metrics.json — target schema

How to fill each field

note.md — target shape

Step 3 — Validate

Individual stage scripts (for debugging)

Non-goals

Hard rules

Similar Skills

paper-extract

Workflow

When to use

Runtime

Step 1 — Prepare the context bundle

Large papers

Step 2 — Synthesize (this is the Claude part)

metrics.json — target schema

How to fill each field

note.md — target shape

Step 3 — Validate

Individual stage scripts (for debugging)

Non-goals

Hard rules

Similar Skills

`metrics.json` — target schema

`note.md` — target shape

`metrics.json` — target schema

`note.md` — target shape