From jarrettmeyer
Read and process PDF files. Use this skill whenever the user asks to read, summarize, extract text from, analyze, or convert a PDF file — even if the request seems simple. The Read tool's PDF support is unreliable on this platform (false size errors, empty content), so this skill provides a fallback chain that actually works. Trigger on any mention of PDF files, even if the user just says "read this" and the path ends in .pdf.
How this skill is triggered — by the user, by Claude, or both
Slash command
/jarrettmeyer:pdfThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
The Read tool frequently fails on PDFs in this environment — false "exceeds 20 MB" errors
The Read tool frequently fails on PDFs in this environment — false "exceeds 20 MB" errors on small files and empty content returns. Do not stop at the first failure. Work through the fallback chain until you have usable text.
The pdfinfo, pdftotext, and pdftoppm commands come from the poppler package.
brew install poppler
Before extracting text, check what you're dealing with:
pdfinfo "<pdf-path>"
This gives you page count, file size, encryption status, and format info. Use it to set expectations for the user (e.g., "This is a 15-page document").
Try each method in order. Stop at the first one that produces usable text.
The most reliable method for text-based PDFs.
pdftotext "<pdf-path>" -
This prints extracted text to stdout. If the document has tables or columns, try with the
-layout flag to preserve spatial formatting:
pdftotext -layout "<pdf-path>" -
If you need Markdown output with headings, lists, and structure preserved:
pandoc "<pdf-path>" -t markdown
Pandoc uses pdftotext internally but produces better-structured output. Prefer this when the user wants a Markdown conversion.
Try the built-in Read tool on the PDF path. This sometimes works but frequently fails on this platform. Only use as a fallback if the CLI tools above fail.
If pdftotext returns empty or near-empty output, the PDF likely contains scanned images rather than text. Use tesseract for OCR:
tmpdir=$(mktemp -d)
pdftoppm "<pdf-path>" "$tmpdir/page" -png
for img in "$tmpdir"/page-*.png; do
tesseract "$img" - 2>/dev/null
done
rm -rf "$tmpdir"
If all methods fail, tell the user clearly: "I couldn't extract text from this PDF. The file may be encrypted, corrupted, or in an unusual format." Suggest they try opening it in a PDF viewer and copying the text manually.
Once you have the extracted text:
pandoc "<pdf-path>" -t markdown for the best structure.
Write the .md file next to the PDF with the same base name, unless the user specifies
a different pathnpx claudepluginhub jarrettmeyer/skills --plugin jarrettmeyerProcess PDF files: extract text, create new PDFs, merge and split documents using pdftotext, PyMuPDF, ReportLab, and pdfkit.
Parses local or remote PDF files into markdown or structured JSON using fastCRW. Supports CLI, MCP, and REST interfaces with options for AI summaries and structured extraction.
This skill should be used when the user says "process documents", "extract text from PDF", "OCR this document", "convert PDF to markdown", "extract emails from documents", "parse document", "document conversion", "batch OCR", "extract structured data from PDF", "read PDF", "extract tables from PDF", "convert Word document", "convert docx to markdown", or wants to extract, convert, or process documents and scanned images.