Skill

pdf-to-text

Use when extracting text from a PDF file, especially when the built-in PDF reader is insufficient or the PDF is too large. Triggers on "read this PDF", "extract text from PDF", "what does this PDF say", scanned documents, image-heavy PDFs.

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/text-utils:pdf-to-text

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Extract text from PDF files. Try the fast path first, fall back to OCR for scanned documents.

SKILL.md

88 lines · ~590 tokens

Stats

LanguageCSS

Parent stars1

MaintenanceExcellent

Last CommitApr 4, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

PDF to Text

Extract text from PDF files. Try the fast path first, fall back to OCR for scanned documents.

Strategy

1. pdftotext (fast, text-based PDFs)

pdftotext "$INPUT" -

For layout-sensitive documents (tables, columns):

pdftotext -layout "$INPUT" -

2. OCR fallback (scanned documents)

If pdftotext returns empty or garbage, the PDF is scanned images. Use ocrmypdf to add a text layer, then extract:

OCR_OUT=$(mktemp /tmp/ocr-XXXXXX.pdf)
ocrmypdf --skip-text "$INPUT" "$OCR_OUT"
pdftotext "$OCR_OUT" -
rm "$OCR_OUT"

Or OCR individual pages directly:

PAGE_DIR=$(mktemp -d /tmp/pages-XXXXXX)
pdftoppm "$INPUT" "$PAGE_DIR/page" -png
for f in "$PAGE_DIR"/page-*.png; do tesseract "$f" stdout; done
rm -r "$PAGE_DIR"

3. Specific pages

Extract a page range before processing:

SUBSET=$(mktemp /tmp/subset-XXXXXX.pdf)
qpdf "$INPUT" --pages . 5-10 -- "$SUBSET"
pdftotext "$SUBSET" -
rm "$SUBSET"

How to tell if OCR is needed

pdftotext "$INPUT" - | wc -w

If the word count is near zero for a multi-page document, it's scanned.

When to use which

Situation	Strategy
Normal PDF, digital text	pdftotext
Scanned document, forms	OCR (ocrmypdf or tesseract)
Large PDF, only need some pages	Extract pages first with qpdf
PDF with complex tables	pdftotext -layout, or tabula-java

Requirements

pdftotext (from poppler)
tesseract (for OCR)
ocrmypdf (optional, wraps tesseract)
qpdf (for page extraction)

macOS: brew install poppler tesseract qpdf and optionally pipx install ocrmypdf

Debian/Ubuntu: sudo apt install poppler-utils tesseract-ocr qpdf and optionally pipx install ocrmypdf

Red Flags

PDF is encrypted or password-protected — decrypt first with qpdf --decrypt
Output is garbled unicode — PDF uses custom fonts without encoding. Try OCR instead.
Tables come out jumbled — pdftotext doesn't understand table structure. Use -layout or a dedicated table extractor.

pdf-to-text

Popularity

Invocation

Context Preview

SKILL.md

pdf-to-text

Popularity

Invocation

Context Preview

SKILL.md

PDF to Text

Strategy

1. pdftotext (fast, text-based PDFs)

2. OCR fallback (scanned documents)

3. Specific pages

How to tell if OCR is needed

When to use which

Requirements

Red Flags

Similar Skills

PDF to Text

Strategy

1. pdftotext (fast, text-based PDFs)

2. OCR fallback (scanned documents)

3. Specific pages

How to tell if OCR is needed

When to use which

Requirements

Red Flags

Similar Skills