From itero
Use when preparing a single document (PDF, DOCX, TXT) for RAG pipelines, vector store ingestion, OpenAI file search, Pinecone/Weaviate/Chroma, or any LLM retrieval context. Triggers include "optimize for RAG", "prepare for vector store", "clean up docs for AI", "extract and reformat", "process this SOP/training doc". Produces chunk-independent Markdown. Single file in, single optimized Markdown out — for multi-file batch consolidation, use doc-consolidator instead.
How this skill is triggered — by the user, by Claude, or both
Slash command
/itero:doc-optimizerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
`<skill-dir>` below means the folder containing this SKILL.md (announced when the
<skill-dir> below means the folder containing this SKILL.md (announced when the
skill loads). Under a Claude Code plugin install this is the skills/doc-optimizer
subfolder of the plugin root; under a manual install it is the skill folder
inside your agent's skills directory. All scripts run via uv run —
dependencies resolve automatically (PEP 723).
Turns one PDF, DOCX, or TXT into a clean, chunk-independent Markdown file ready for vector-store ingestion. The goal: when a chunker splits the output automatically, each chunk should be self-contained and immediately useful to an LLM without needing surrounding context.
This is not summarization. Every piece of informational content survives. Structure, redundancy, and noise change — substance does not.
The bundled script at <skill-dir>/scripts/extract.py runs under uv. uv reads the script's inline dependency declaration and creates an isolated venv on first run — no separate pip install step. If uv is missing, the user installs it once per machine (see the repo's INSTALL.md). After that, every uv run … invocation is self-contained.
Preserve all meaning. Every informational piece from the source survives the transformation. If you removed content, an LLM retrieving any single chunk must still have what it needs.
Chunk independence is the prime directive. A chunk that says "as described above" or "see Section 3" is useless to retrieval. Every section needs enough context to stand alone.
| Bad | Good |
|---|---|
| "Follow the procedure in Section 2.1" | "Follow the equipment calibration procedure (calibrate the sensor using baseline readings, then verify against the reference standard)" |
| "See the table above for thresholds" | "See the temperature thresholds: 40–60°F nominal, 60–80°F warning, >80°F alarm" |
Remove noise, not signal. Strip extraction artifacts and repeated boilerplate. When in doubt, keep it.
uv run "<skill-dir>/scripts/extract.py" <input-file> --output /tmp/raw.txt
Handles PDF (pymupdf + pdfplumber for tables), DOCX (python-docx), TXT. Tables are emitted as GitHub-flavored Markdown.
Exit codes:
0 — success1 — extraction failure (unsupported format, scanned PDF with no text layer, etc.) — message on stderrIf exit 1 on a PDF, the doc is likely scanned. Tell the user to OCR first (e.g., ocrmypdf in.pdf out.pdf) and retry.
Read the extracted text and pick a strategy:
Strip:
Fix:
recon-\nstruct → reconstruct)Test: if you removed something, does an LLM retrieving any single chunk still have what it needs? If a safety warning is critical to a specific procedure, keep a brief contextual reference even after deduplication.
Add context headers. Every major section opens with enough context that a cold reader understands what they're looking at: what the section covers, what system/process/domain it relates to, any critical prerequisites.
Example:
## Sensor Calibration Procedure — Model X200 Temperature Monitoring System
This procedure covers the monthly calibration of temperature sensors in the X200 monitoring system. Calibration must be performed by certified technicians with access to the reference standard kit (Part #RS-4401).
Normalize hierarchy. H1 = document title. H2 = major sections. H3 = subsections. Don't go deeper than H4 — flatten if needed.
Make procedures explicit. Implicit steps buried in prose ("Then you would…") get extracted into numbered steps.
Preserve tables. Keep tabular data as Markdown tables — most chunkers handle them well.
Group related content. If the same topic is scattered across the source, consolidate it.
Write to ./optimized/<descriptive-slug>.md (create the optimized/ subfolder alongside the input if missing).
The filename matters for retrieval. It ends up as vector-store metadata. When a retrieval agent surfaces a chunk, the filename alone needs to disambiguate it from sibling docs in the index. A name like Acme_Day1_Training-optimized.md adds nothing — it just echoes the input. A name like acme-new-hire-training-day-1-role-kpis-hipaa-tools.md tells the agent what the chunk is about before it reads a word of content.
Slug rules:
acme-, globex-).training-deck, call-transcript, coaching-reference, agent-guide, sop, script, playbook).day-1-onboarding, awv-decline-recovery, objection-handling, mock-call-exercises).call-transcript-villa-adams-elevance-declined.optimized, processed, final, cleaned, or other process verbs — the optimized/ folder name conveys that. Wasted token budget on the filename.Output template:
# [Document Title — natural-language version of the slug]
> **Source**: [original filename] | **Type**: [SOP/Training/Transcript/Reference] | **Processed**: [date]
[2–3 sentence summary of what the document covers and who it's for — helps the LLM anchor any chunk retrieved from it.]
## [Section Title — descriptive and self-contained]
[Context anchor — what this section is about and how it fits]
[Content...]
Before delivering:
extract.py detects an empty text layer and exits 1. Flag to the user; suggest OCR (ocrmypdf). Do not fabricate content.[Diagram: Equipment Layout — see original document] so the LLM knows something visual exists there.doc-consolidator.| Mistake | Fix |
|---|---|
| Summarizing instead of restructuring | Every informational sentence survives in some form |
| Leaving cross-references intact | Inline the referenced context or add a brief anchor |
| Over-nesting headings (H5, H6) | Cap at H4; flatten deeper nesting |
| Converting tables to prose | Keep tables as Markdown tables |
| Stripping all safety boilerplate | Keep one canonical version; reference it where critical |
Provides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.
npx claudepluginhub itero-ai/skills --plugin itero