From pdf-to-text
Extract clean text and markdown from any PDF. Fixes broken Unicode mappings that make Chrome's copy-paste produce gibberish. Returns plain text, basic markdown, or structured markdown with TOC, headings, page markers, and token counts. Works locally via WASM — your PDFs never leave your machine.
How this skill is triggered — by the user, by Claude, or both
Slash command
/pdf-to-text:extract-pdfThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
```bash
_PLUGIN_DIR="${CLAUDE_PLUGIN_ROOT:-$(cd "$(dirname "$0")/.." 2>/dev/null && pwd || echo "$HOME/.config/pdf-to-text")}"
_UPD=$("$_PLUGIN_DIR/bin/update-check" 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
If UPGRADE_AVAILABLE <old> <new> is output: tell the user a new version is available and ask if they want to upgrade. If yes, run $_PLUGIN_DIR/hooks/install-engine.sh. If JUST_UPGRADED <old> <new>: tell the user "PDF to Text engine updated to v{new}!" and continue.
A local PDF-to-markdown extraction engine with a 7-level
fallback cascade that recovers text from PDFs whose embedded Unicode
mappings are broken or missing. It outputs three markdown formats:
plain, basic (one # Page N per page), and structured
(YAML frontmatter + TOC + font-detected headings + fenced code blocks).
Use this skill when you need clean text from a PDF, not when you need the visual layout. For tables, figures, and images, look elsewhere.
Use glyph-api when:
.pdf and wants the contentcat/pdftotext produced garbage on a
particular PDFDo NOT use glyph-api for:
camelot, tabula-py, pdfplumber).tesseract or a cloud OCR service first, then feed the resulting
text-layered PDF to glyph-api.Four invocation paths, ordered by agent-friendliness.
This plugin registers MCP tools automatically. Use them directly:
extract_pdf — pass url or path, get plain/basic/structured markdown + statsrender_markdown — fetch and parse any .md URL, get sections + token countlist_recent — see recently extracted PDFs from the local cacheUse the extract_pdf tool with path: "/path/to/document.pdf" and format: "structured"
If you're driving a browser (Claude in Chrome, Playwright, Puppeteer, CDP):
// Navigate to any .pdf URL. The extension's DNR rule intercepts and
// redirects to the viewer. The viewer runs the WASM extractor and
// publishes the result to a page-world global.
await navigate("https://example.com/document.pdf");
// Wait for extraction to finish.
await waitFor(() => window.__glyph?.status === "ready");
// Read the three markdown formats (all pre-computed).
const plain = window.__glyph.markdown.plain;
const basic = window.__glyph.markdown.basic;
const structured = window.__glyph.markdown.structured;
The viewer also fires a glyph:status CustomEvent and sets
document.body.dataset.glyphStatus to loading / ready / error, so
agents that prefer selector-based waits can use:
await waitForSelector('body[data-glyph-status="ready"]');
Iframe-embedding pages receive a postMessage with {type: "glyph:status", status, markdown} once ready.
Requirements: the Glyph Chrome extension must be installed (unpacked
or from the Chrome Web Store). The extension ID is stable for a given
store listing; print it from any viewer tab via chrome.runtime.id in
the DevTools console.
cd /Users/jordan/Code/glyph-api
./target/release/glyph-api path/to/document.pdf
Outputs plain text to stdout. One form-feed (\f) between pages. No
frontmatter, no headings, no markdown syntax — just the extracted text.
To build the binary from source (~5s):
cd /Users/jordan/Code/glyph-api
cargo build --release --bin glyph-api
import initWasm, { extract_chars_with_positions } from "./pkg/glyph_api.js";
import { readFileSync } from "node:fs";
await initWasm({ module_or_path: readFileSync("./pkg/glyph_api_bg.wasm") });
const pdfBytes = readFileSync("document.pdf");
const json = extract_chars_with_positions(pdfBytes);
const parsed = JSON.parse(json);
// parsed.pages[i].text — per-page extracted text
// parsed.pages[i].chars — per-char position metadata (for spatial queries)
// parsed.total_chars, total_resolved, total_unresolved
To also get the three markdown formats, import the markdown module:
import {
toPlain,
toBasicMarkdown,
toStructuredMarkdown,
} from "./extension/dist/markdown.js";
const structured = toStructuredMarkdown(parsed, {
srcUrl: "https://example.com/document.pdf",
});
<page 1 text>
<page 2 text>
...
No markdown. Blank line between pages. Use for embeddings and keyword search.
# Page 1
<escaped page 1 text>
---
# Page 2
<escaped page 2 text>
Page headers, minimal escaping. Use when you want page-boundary awareness without heading detection.
---
source: https://example.com/document.pdf
pages: 9
chars: 21146
tokens: ~5.3k
headings: 13
extracted_at: 2026-04-11T17:00:00.000Z
extractor: glyph-api
---
## Table of Contents
- Bitcoin: A Peer-to-Peer Electronic Cash System _(p.1)_
- 1. Introduction _(p.1)_
- 2. Transactions _(p.2)_
...
<!-- page 1 -->
## Bitcoin: A Peer-to-Peer Electronic Cash System
Satoshi Nakamoto
[email protected]
www.bitcoin.org
## Abstract
A purely peer-to-peer version of electronic cash...
## 1. Introduction
Commerce on the Internet has come to rely...
Notable features:
<!-- page N --> markers — invisible when rendered, queryable for
per-page chunking in RAG pipelines^\d+\.\s+[A-Z])#include), Python (def), math formulas
(∑ ⋅ ≤), and diagram labels — keeps GitHub's markdown renderer from
choking on themdense_20p_70l_r2.pdf
contains the same sentence repeated 1400 times by design. That's the
input, not an extraction bug.fi ligature bug — papers using ligatures (fi, fl,
ffi, ffl) may emit raw glyph codes like 002gures instead of
figures. Future work: extend the L3/L4 cascade to recognize ligature
glyph names.Tj operator produce Form10 4 0 2025 instead of Form 1040 (2025). Future work: detect tight-spacing runs and apply word grouping.110TH CONGRESS set with positive
letter-tracking extracts as 110 TH C ONGRESS. Future work: detect
heading-level letter-spacing and collapse.\``textfor legibility but won't render via MathJax. Future work:$$...$$` wrapping
for paragraphs where symbol density is high enough.If you're unsure whether glyph-api improved on Chrome's native extraction
for a given PDF, look at the frontmatter's headings: count and the
structured output's TOC — both should be non-empty for any real
document. For the famous "Attention Is All You Need" paper (arxiv
1706.03762), the structured output has:
docmap --type code --lang c on the
appendixIf the output has headings: 0 and no TOC, the PDF likely has unusual
content structure — try feeding it back and reporting the file shape.
If you have docmap (v0.4.0+) installed, you can verify the output's
structural integrity:
docmap extracted.structured.md # section tree
docmap extracted.structured.md --type code --lang c # find C blocks
docmap extracted.structured.md --type math # find math blocks
docmap extracted.structured.md --json | jq '.documents[0].sections'
This gives independent confirmation that the extracted markdown is semantically navigable — agents can jump to sections, count constructs, and reason about document structure.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub jordancoin/pdf-to-text --plugin pdf-to-text