From text-utils
Use when extracting text from a PDF file, especially when the built-in PDF reader is insufficient or the PDF is too large. Triggers on "read this PDF", "extract text from PDF", "what does this PDF say", scanned documents, image-heavy PDFs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/text-utils:pdf-to-textThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Extract text from PDF files. Try the fast path first, fall back to OCR for scanned documents.
Extract text from PDF files. Try the fast path first, fall back to OCR for scanned documents.
pdftotext "$INPUT" -
For layout-sensitive documents (tables, columns):
pdftotext -layout "$INPUT" -
If pdftotext returns empty or garbage, the PDF is scanned images. Use ocrmypdf to add a text layer, then extract:
OCR_OUT=$(mktemp /tmp/ocr-XXXXXX.pdf)
ocrmypdf --skip-text "$INPUT" "$OCR_OUT"
pdftotext "$OCR_OUT" -
rm "$OCR_OUT"
Or OCR individual pages directly:
PAGE_DIR=$(mktemp -d /tmp/pages-XXXXXX)
pdftoppm "$INPUT" "$PAGE_DIR/page" -png
for f in "$PAGE_DIR"/page-*.png; do tesseract "$f" stdout; done
rm -r "$PAGE_DIR"
Extract a page range before processing:
SUBSET=$(mktemp /tmp/subset-XXXXXX.pdf)
qpdf "$INPUT" --pages . 5-10 -- "$SUBSET"
pdftotext "$SUBSET" -
rm "$SUBSET"
pdftotext "$INPUT" - | wc -w
If the word count is near zero for a multi-page document, it's scanned.
| Situation | Strategy |
|---|---|
| Normal PDF, digital text | pdftotext |
| Scanned document, forms | OCR (ocrmypdf or tesseract) |
| Large PDF, only need some pages | Extract pages first with qpdf |
| PDF with complex tables | pdftotext -layout, or tabula-java |
pdftotext (from poppler)tesseract (for OCR)ocrmypdf (optional, wraps tesseract)qpdf (for page extraction)macOS: brew install poppler tesseract qpdf and optionally pipx install ocrmypdf
Debian/Ubuntu: sudo apt install poppler-utils tesseract-ocr qpdf and optionally pipx install ocrmypdf
qpdf --decrypt-layout or a dedicated table extractor.npx claudepluginhub jackwillis/claude-plugins --plugin text-utilsExtracts text and structured data from single or batch PDFs using 9 backends with automatic fallback, OCR for scanned docs, and markdown output via CLI.
Process PDF files: extract text, create new PDFs, merge and split documents using pdftotext, PyMuPDF, ReportLab, and pdfkit.
Processes PDF files: extracts text and tables, fills forms, merges/splits documents, batch-processes, converts to images, and generates PDFs programmatically using pypdf, pdfplumber, reportlab, and CLI tools.