From pdfscribe
Convert PDF documents to structured Markdown files. Full pipeline - extract pages (images + text), convert to Markdown, verify against source, auto-fix errors. Use when user wants to digitize PDF documents into Markdown technical docs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/pdfscribe:pdf-to-mdThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Convert PDF documents into structured, verified Markdown technical documentation.
Convert PDF documents into structured, verified Markdown technical documentation.
You are acting as a System Analyst (SA). The Markdown files you produce are development reference documents — the development team will directly use them to implement features, write code, and build systems. This means:
> Corresponding PDF pages: P{start}-P{end}) so developers can cross-check against the original source when needed./pdf-to-md <pdf_path> [--single] [--lang <language>]
<pdf_path> — Path to PDF file (required)--single — Output single Markdown file instead of splitting by chapters (optional)--lang <language> — Markdown output language, e.g. zh-TW, en (optional, auto-detect by default)Fixed output path: docs/pdfscribe/{pdf-filename}/
docs/pdfscribe/{pdf-filename}/
├── extracts/
│ ├── images/ # Page screenshots (300 DPI)
│ │ ├── page_01.png
│ │ └── page_02.png
│ ├── texts/ # Extracted text per page
│ │ ├── page_01.txt
│ │ └── page_02.txt
│ └── embedded/ # Embedded images from PDF (charts, diagrams, photos)
│ ├── metadata.json # Image position, size, classification
│ ├── page_03_img_01.png
│ └── page_15_img_01.jpg
├── 00-Table-of-Contents.md # Chapter index (split mode)
├── 01-Chapter-Name.md
└── 02-Chapter-Name.md
File names adapt to the user's language, except
00-Table-of-Contents.mdwhich is always fixed.
Execute all 4 stages in order. No stage may be skipped.
Run the extraction binary to obtain page screenshots and text.
Locate the binary: Use the Glob tool to search for **/pdfscribe/scripts/pdf-extract-* and obtain the actual path. Choose the binary matching the current platform (windows-amd64, linux-amd64, darwin-arm64, etc.).
<actual_binary_path> <pdf_path>
Output goes to docs/pdfscribe/{pdf-name}/extracts/:
images/ — page screenshots (300 DPI)texts/ — extracted text per pageembedded/ — embedded images + metadata.json (position, size, classification)No runtime dependencies are needed — the binary is fully self-contained.
After extraction, verify that the file count matches the expected number of pages.
First, read the conversion guidelines: use Glob to search for **/pdfscribe/references/conversion-guidelines.md, obtain the actual path, and read it.
Split mode (default):
Read all text files to understand the complete document structure
Identify chapter boundaries (major headings, topic transitions, title pages)
Create a chapter plan: {number}-{chapter-name}.md with page ranges
Report the chapter plan to the user in a table before proceeding:
| No. | File | Pages | Description |
|---|---|---|---|
| 01 | 01-Chapter-Name.md | P1-P5 | Brief description |
| 02 | 02-Chapter-Name.md | P6-P10 | Brief description |
Wait for user confirmation before generating chapter files.
Dispatch parallel agent workers, each responsible for one chapter (max 15 pages per agent):
extracts/texts/ files — these are the authoritative source. Page screenshots (extracts/images/) are for layout reference and image placement only, NOT for reading text content.extracts/embedded/metadata.json for the image inventory. Follow these rules:
"classification": "decorative""content" image, visually inspect it (Read tool) before embedding — skip template assets, repeated logos, layout decorationsy / page_height to place the image at the correct vertical position in the MarkdownGenerate a 00-Table-of-Contents.md index file linking all chapters (this file name is fixed and does not change with the user's language)
Other file names adapt to the user's language.
Single-file mode (--single):
These documents will be used directly for software development. Verify from a developer's perspective: would a developer reading this Markdown produce the correct implementation?
Dispatch parallel agent workers to verify the accuracy of Markdown content against the original PDF data:
extracts/images/)extracts/texts/)metadata.json to ensure all "content" images are referenced at the correct positionsCRITICAL: This is a LOOP, not a single pass. You MUST repeat fix → verify → fix → verify until zero issues remain or the round limit is reached. Do NOT stop after one round of fixes.
Set round = 1. Then repeat:
round > 5 → STOP. Report remaining issues to the user.round = round + 1Common mistake to avoid: Do NOT treat step 5 as optional. Every round of fixes MUST be followed by a full re-verification. The loop only ends when verification reports show zero issues or round > 5.
[Unclear in Original] for English users)[Sic] for English users)When all stages are finished, report the output directory path to the user:
Conversion complete. Output:
docs/pdfscribe/{pdf-filename}/
npx claudepluginhub gonetone/pdfscribe --plugin pdfscribeRoutes PDF conversions through analysis to select the best extraction strategy and tools based on document type and output format.
Converts PDF, Word, PPTX, PPT, and TXT documents to Markdown, preserving titles, lists, tables, structure, and PPT slide sections. Adds frontmatter, annotates images, suggests output path for archiving or import.
Parses local files (PDF, DOCX, XLSX, HTML, etc.) into clean markdown on disk. Offers AI summaries and Q&A over document content.