From proofreader
Normalize a paper input (PDF, LaTeX source, multi-file LaTeX project, or pre-extracted text) into a clean structured representation that downstream Proofreader skills can consume. Use this once at the start of any non-trivial Proofreader session so the audit, counterexample, defender, and writeup skills don't each re-parse the source. Especially useful for LaTeX projects, where it preserves theorem-environment fidelity that PDF extraction loses. Triggers on "prepare this paper", "set up context for my .tex source", "extract theorems from this project", or when the user supplies a .tex / project root.
How this skill is triggered — by the user, by Claude, or both
Slash command
/proofreader:prepare-paper-contextThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are the input-normalization stage for the Proofreader plugin. Your job is to take a paper in whatever form the user supplies — a PDF, a single `.tex` file, a multi-file LaTeX project, or already-extracted text — and produce a clean structured representation that downstream skills (`evaluate-paper`, `audit-proof`, `find-counterexample`, etc.) can use without re-parsing.
You are the input-normalization stage for the Proofreader plugin. Your job is to take a paper in whatever form the user supplies — a PDF, a single .tex file, a multi-file LaTeX project, or already-extracted text — and produce a clean structured representation that downstream skills (evaluate-paper, audit-proof, find-counterexample, etc.) can use without re-parsing.
This skill is most valuable for LaTeX sources, where it preserves theorem-environment fidelity, label/ref topology, and math-symbol accuracy that PDF extraction loses. For PDFs it's still useful as a one-time extraction so downstream calls don't re-extract.
When the author has the .tex source (their own paper or a coauthor's), use it instead of the PDF whenever possible:
\begin{theorem}[Optional Title]\label{thm:foo}...\end{theorem} blocks have unambiguous boundaries. PDF extraction guesses these from formatting cues and often gets them wrong.\sum_{i=1}^n C_i / T_i verbatim. PDF extraction often produces Σ Ci /Ti or worse.\ref{thm:foo} and \label{thm:foo} form a bipartite graph; dangling references (a documented true-positive flaw pattern) become trivial to detect — just diff the two sets.\bibitem / .bib entries are machine-readable.annotate-latex skill can produce in-place review annotations.For retrospective audits of an author's own already-published work, the LaTeX source is often no longer accessible — the paper exists only as a PDF in the venue's proceedings, sometimes years or decades after submission. This is the expected case for that workflow, not a fallback. PDF extraction is lossier than LaTeX parsing, but the downstream audit skills are calibrated to consume it.
The discipline in this case is to make the loss visible rather than hide it:
Downstream skills will treat extraction warnings as a signal to scrutinize affected results more carefully; the audit may end up flagging a result as uncertain for extraction reasons rather than substantive reasons, which is the right behavior.
The worst extraction errors are not the obviously mangled ones (Σ Ci /Ti) — those get flagged and scrutinized. The dangerous ones are extractions that come out clean-looking but wrong: a ⌈·⌉ ceiling read as a ⌊·⌋ floor, a dropped +1, a ≤ read as <, a flipped subscript range, a quantifier ∀ℓ>1 read as ∀ℓ>0. These read as valid math, pass downstream unquestioned, and can manufacture a confident false finding (or hide a real one) because a single character in a load-bearing inequality changed.
Equations and proofs typeset as figures/images are the highest-risk source. Many RT-systems papers render display equations (and sometimes whole proofs) as embedded images. Text extraction either drops these entirely or returns an OCR-style guess that looks like a transcription but was never in the text layer. The reader cannot tell a faithful transcription from a fabricated one.
While extracting, identify equations that are not reliably in the text layer:
page.get_images() and page.get_drawings() returning content in the band where a numbered equation should be, with little or no adjacent extracted text, is the tell.Eq. (10) / Equation (10) but no line of extracted text on that page parses as that equation, the equation is almost certainly a figure.□/�/missing-glyph runs, or display lines that extract as empty/whitespace, as suspect.For every formal result whose statement or proof depends on such an equation, mark it explicitly so downstream skills know not to trust the transcribed formula:
**Formula fidelity**: UNVERIFIED — Eq. (10) and the L_i definition (Eq. 9) are
typeset as figures; the transcription below is a guess. Any verdict that turns
on the exact form of this equation MUST be re-checked by reading the PDF page
as an image (Read the PDF with `pages: <n>`) before it is trusted.
Do not silently emit a transcription as if it were faithful. A wrong-but-plausible formula passed downstream is worse than an honest "could not read this — open page N."
One of:
.tex file..tex file of a multi-file project (the one with \documentclass)..tex if present, else look for a PDF).Optionally: a path to a .bib file if the bibliography is external.
Inspect the supplied path:
.tex? → LaTeX single file or main file. Read it; check for \input{} / \include{} / \subfile{} directives..pdf? → PDF. Probe extractors in order (pymupdf4llm, pymupdf, pdftotext) — see Step 2's PDF branch for the exact probe and the reporting requirements.main.tex / paper.tex / <dirname>.tex. If no main .tex, fall back to the largest PDF.For LaTeX:
\input{}, \include{}, and \subfile{} directives recursively. Concatenate into one logical document, preserving file boundaries for error reporting.% to end of line) unless they're in verbatim environments. Preserve any % PROOFREADER comments from previous annotate-latex runs separately for the user to re-review.\newcommand / \def macros that the document defines (best-effort; if a macro is complex, leave the call site intact and note this).For PDF:
python -c "import pymupdf4llm" — preferred. Markdown-structured output preserves headings, lists, and tables.python -c "import pymupdf" (a.k.a. fitz) — acceptable fallback. Plain text, no structure preservation.pdftotext --version (poppler-utils) — last-resort fallback. Loses the most layout information.**PDF extractor** field of the output to one of pymupdf4llm | pymupdf | pdftotext, and if the active extractor is not pymupdf4llm, add an extraction warning of the form:
pymupdf4llmis the preferred PDF extractor but is not installed in this environment; falling back to<tool>. Install withpip install pymupdf4llmand re-run for better math/table fidelity.
pymupdf4llm (preferred), pymupdf, or pdftotext (poppler-utils) and re-run.".aux if presentIf the paper has been compiled recently and a sibling .aux file exists (e.g., paper.tex next to paper.aux), parse it to enrich the label topology with rendered-page numbers:
\newlabel{thm:foo}{{3}{5}{...}} → label thm:foo is Theorem 3 on rendered page 5. Surface this in the label/ref topology table.\bibcite{baruah2020}{2} → citation key baruah2020 is [2] in the rendered bibliography. Useful for matching citations in the prose ("see [2]") back to bibliography entries.If no .aux exists, omit the rendered-page column from the topology and note this in extraction warnings. Do not invoke pdflatex to generate one — that introduces a build dependency outside this skill's scope.
Locate the paper's bibliography:
\begin{thebibliography}...\end{thebibliography} block: parse \bibitem[<label>]{<key>} entries; for each, extract author, title, venue, year, URL, DOI from the free-text content (heuristic but workable)..bib file referenced by \bibliography{file} or \addbibresource{file.bib}: parse the .bib directly using standard BibTeX entry syntax.\cite{key}, \citet{key}, \citep{key}: associate the in-text citation with the bibliography entry.For each cited entry, record whether the paper is invoked (cited but the result is used without restatement: "by Liu-Layland's theorem") or restated (the result is reproduced verbatim or paraphrased in the paper's own theorem environment). Restatements are flagged for the verify-restatement agent to consider for cross-paper verification.
Detect restatements by these signals:
\begin{theorem}[<Citation>] or \begin{lemma}[<Citation>] — the optional argument names a prior source.Output the following as a Markdown document:
# Paper context: <title>
**Source format**: pdf | latex_single | latex_project | text
**Source path**: <path>
**PDF extractor**: pymupdf4llm | pymupdf | pdftotext *(PDF inputs only; omit for LaTeX/text. Annotate with "(preferred)" or "(fallback — `pymupdf4llm` not installed)" as applicable.)*
**Title**: <title, parsed from \title{} or PDF metadata or first heading>
**Authors**: <list, parsed from \author{} or PDF metadata>
## Sections
- 1. Introduction (lines L1–L2)
- 2. System Model (lines L3–L4)
- …
## Formal results
For each theorem-like environment:
### thm:foo — Theorem 3 (Section 4.1)
- **Type**: theorem | lemma | corollary | proposition | definition
- **Label**: thm:foo (LaTeX) or "Theorem 3" (PDF)
- **Statement**: <verbatim>
- **Formula fidelity**: verified (LaTeX source) | text-extracted (plausible) | UNVERIFIED (equation typeset as figure — read PDF page N as image before trusting) — name the at-risk equation(s) and page
- **Proof present**: yes | no | deferred_to_appendix | cited_to_prior_work
- **Proof text**: <verbatim, if present>
## Label / ref topology
| Label | Defined at | Referenced at |
|---|---|---|
| thm:foo | paper.tex:142 (rendered page 5) | paper.tex:89, 203, 305 |
| eq:bound | paper.tex:167 (rendered page 6) | paper.tex:168, 210 |
**Unresolved references** (referenced but not yet defined):
- `lem:gap` referenced at paper.tex:204, no matching `\label` in the current source.
*Note: unresolved references are recorded here for the author's awareness but are NOT treated as flaws by Proofreader. In active drafting, unresolved refs are routine (a forward reference to a not-yet-written lemma, a holdover from a previous revision, a placeholder). The audit skills will not surface these as findings.*
**Unused labels** (defined but never referenced): informational only.
## Bibliography
For each `\bibitem` / `.bib` entry that is `\cite`-d in the body:
| Citation key | Title | Authors | Venue / year | URL or DOI |
|---|---|---|---|---|
| baruahFL2020 | Schedulability analysis using ILP | Baruah | RTNS 2020 | https://doi.org/... |
| liuLayland1973 | Scheduling algorithms for multiprogramming in a hard-real-time environment | Liu, Layland | JACM 1973 | https://doi.org/... |
If the paper restates a theorem from a cited source (`Theorem 1 (Liu-Layland)`), record the restatement → citation pairing here. The [`verify-restatement`](../../agents/verify-restatement.md) agent uses this list to fetch cited sources and double-check the restatement matches the original.
## Restatements
When the paper restates a theorem/lemma from prior work (typically signaled by `\begin{theorem}[<Citation>]` or by an in-text *"Theorem (Liu-Layland)"*):
| Restated label | Cited as | Citation key | Citation type | Worth verifying? |
|---|---|---|---|---|
| thm:liu-layland-utility-bound | Liu-Layland 1973 | liuLayland1973 | restatement (verbatim or paraphrase of original) | yes if a fetchable source exists |
| thm:dbf-bound | Baruah 2003 | baruah2003 | invocation (used but not restated) | no — invocation only |
Restatements are the high-leverage targets for cross-paper verification: a paper that subtly changes a precondition while restating a known result can propagate the change as if it were the original.
## Notation table
If the paper has an explicit notation table, extract verbatim. Otherwise, infer from theorem statements and provide a best-effort table.
## Extraction warnings
Issues encountered during parsing — useful for the user to know which fidelity to trust:
- "Multi-file project: 3 files resolved, 1 file (`extras.tex`) not found"
- "Equation 5 contains a complex macro `\sched` that could not be expanded automatically"
- "PDF extraction: theorem block at page 7 has unusual formatting; manually verify"
- "PDF extraction: Eq. (9) and (10) on page 7 are typeset as figures — transcription is a guess, Formula fidelity = UNVERIFIED for Theorem 10 and Lemma 15; read page 7 as an image before trusting any floor/ceiling/±1 in those equations"
- "No `.aux` file found at `paper.aux`; rendered page numbers omitted from label topology"
- "`pymupdf4llm` is the preferred PDF extractor but is not installed; falling back to `pymupdf`. Install with `pip install pymupdf4llm` for better math/table fidelity." *(omit this warning entirely when `pymupdf4llm` is the active extractor)*
If the user requests it, save the structured representation to <paper-stem>.context.md so subsequent skill invocations can Read it without re-running this skill.
Downstream skills should accept either a path to a paper or a pre-prepared context document:
> Audit Theorem 3 of @my-paper.tex
# invokes evaluate-paper which calls prepare-paper-context first
> Audit Theorem 3 from @my-paper.context.md
# skips prepare-paper-context; uses the pre-extracted representation directly
pymupdf4llm is preferred for PDF extraction because it preserves Markdown-like structure (headings, lists, tables). Falls back to pymupdf if pymupdf4llm isn't installed; falls back to pdftotext from poppler-utils if neither is available. Install the preferred extractor with pip install pymupdf4llm.**PDF extractor** field, and surface a clear extraction warning when falling back so the user knows to install pymupdf4llm for higher fidelity.pdflatex. If a .aux file is present, you may consult it for label/page-number mapping but it's not required.npx claudepluginhub binarybison/proofreader --plugin proofreaderProvides a checklist for code reviews covering functionality, security, performance, maintainability, tests, and quality. Use for pull requests, audits, team standards, and developer training.