This skill should be used when the user asks to "convert a PDF book to Markdown", "extract a book", "process this PDF", or has PDF books to organize or make searchable. Also triggers when the user mentions a book PDF or drops a PDF path that appears to be a book (chapters, table of contents, forewords). Specifically for books, not papers or short documents. If it has an Abstract and References section, use extract-study instead.
How this skill is triggered — by the user, by Claude, or both
Slash command
/obscura-scraper-crawler:extract-bookThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Convert PDF books into well-structured Markdown files with proper chapter headings, metadata blocks, horizontal rule dividers, and cleaned text. The bundled Python script handles chapter detection automatically using multiple strategies.
Convert PDF books into well-structured Markdown files with proper chapter headings, metadata blocks, horizontal rule dividers, and cleaned text. The bundled Python script handles chapter detection automatically using multiple strategies.
The script requires Python 3.10+ and the pdfplumber library. Set up a virtual environment if one doesn't already exist:
python3 -m venv .venv && source .venv/bin/activate && pip install pdfplumber
If .venv already exists, just activate and ensure pdfplumber is installed:
source .venv/bin/activate && pip install pdfplumber 2>/dev/null
Add .venv/ to .gitignore if it's not already there.
Always start with --dry-run to preview what the script detects before writing output:
source .venv/bin/activate && python {SKILL_DIR}/scripts/extract_book_pdf.py "<path-to-pdf>" --dry-run
This prints:
Show the dry-run output and check:
Run with --render-images to capture image-based pages (chapter titles, diagrams, charts, tables rendered as images):
source .venv/bin/activate && python {SKILL_DIR}/scripts/extract_book_pdf.py "<path-to-pdf>" --render-images -o "<output-path>.md"
This does two things:
.tmp/extracted_images/<book>/<!-- IMAGE: path | Page N — ... --> placeholders in the Markdown where those pages appearIf -o is not specified, the output goes next to the PDF with the same name and .md extension.
If the user doesn't need image processing, omit --render-images for a text-only extraction (faster, no image files generated).
After extraction, search the output for <!-- IMAGE: placeholders. For each one:
## heading above, delete the placeholder. If it provides a better title, update the heading.**Figure N:** [description]Work through placeholders in order, batch-reading images when they're adjacent. For books with many image pages (50+), ask the user if they want to process all of them or just the important ones (chapter titles, tables, charts).
After the vision pass, read the first 20-30 lines of the output and fix:
Title: If auto-detected title is wrong (common with books that have disclaimer pages first), edit the # Title line to the actual book title with subtitle.
Source metadata block: Every book extraction must have a complete metadata block immediately below the title. The script auto-detects what it can, but many PDFs lack structured copyright pages. Fill in any gaps manually — check the PDF's first few pages, or look up the book online if needed. The required fields are:
If the script missed any of these, add them by hand. A book extraction without at least author, copyright year, and ISBN is incomplete.
Spot-check a chapter transition: Read around a ## Chapter heading to verify content flows correctly and there's no bleed from the previous chapter.
Verify no placeholders remain: Search for <!-- IMAGE: to confirm all were processed.
Location: Ask the user where to save it unless the project's CLAUDE.md or an existing folder of book extractions makes it obvious. Match the filing pattern of neighboring books if there is one.
Filename: Lowercase kebab-case — <author>-<short-title>.md.
<author> is the author's name, normalized (drop middle initials). "Sally K Norton" → sally-norton.<short-title> is the book's main title, 3–6 words. Drop the subtitle — keep the full title + subtitle in the # H1 heading inside the file.-, drop apostrophes/accents. "Mendoza-López" → mendoza-lopez.Example: sally-norton-toxic-superfoods.md.
PDF handling: Rename the source PDF to match the Markdown filename exactly (same kebab slug, same directory, .pdf extension) so the pair is discoverable under a single search. Use git mv if the PDF is already tracked.
If the book references or is referenced by other material already in the project:
Add and commit with a descriptive message.
# Book Title: Subtitle
**Author:** Author Name, Credentials
**Publisher:** Publisher Name
**Copyright:** Copyright © Year
**ISBN:** 978-x-xxxxxx-xx-x
**Edition:** [if applicable]
---
## Foreword by [Name]
[extracted text]
---
## Chapter 1: Chapter Title Here
[extracted text]
---
## Chapter 2: Next Chapter Title
[extracted text]
---
The metadata block uses labeled fields (like extract-study and extract-transcript) so every book in the repo has a consistent, scannable header. The script outputs these fields automatically when it can detect them from the PDF. Fields the script misses must be filled in during post-processing.
The script tries four strategies in order, using the first one that finds 3+ chapters:
Text markers — Scans for CHAPTER N or Chapter N: patterns at the start of pages. Most common in traditionally formatted books.
Single-number chapters — Some publishers (e.g., Simon & Schuster) mark chapters with just a bare number (1, 2, 3) as the first line. The script validates these are sequential.
Section headers — For workbook-style books without traditional chapters. Detects ALL-CAPS headers. If there are too many (>25), keeps only multi-page sections.
TOC-based sections — For guidebooks. Parses the Table of Contents, then fuzzy-matches each entry to actual page content.
Additionally, the script always detects:
Part 1, Part II, etc.OceanofPDF.com watermark lines## headings)To process multiple PDFs in a directory:
source .venv/bin/activate
for pdf in /path/to/pdfs/*.pdf; do
python {SKILL_DIR}/scripts/extract_book_pdf.py "$pdf" --dry-run
done
Review the dry runs, then run without --dry-run for the ones that look good. Books with tricky formats may need the output path specified with -o.
--dry-run output and consider running the script to get a flat extraction, then manually add ## Chapter headings.# Title line manually after extraction.--dry-run to verify before extracting.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub noisemeldorg/skills --plugin extraction-skills