From document-processor
Ingest business documents (.pptx, .pdf, .xlsx, .docx) into FAITHFUL markdown for a RAG knowledge base or review — VLM-first (Vision-Language Model), NOT OCR. Use this whenever the user wants to convert, ingest, extract, or "turn into markdown/text" any slide deck, report, spreadsheet dashboard, or PDF — ESPECIALLY chart-, dashboard-, or image-heavy and Vietnamese/mixed-language documents where chart numbers, tables, and verbatim text must survive. Trigger on phrases like "convert this deck/report/pptx/pdf to markdown", "ingest these documents", "extract the data/tables/charts from this presentation", "build a knowledge base from these files", or any document→markdown task where faithfulness matters. Prefer this over a plain text-extraction/OCR approach, which silently drops charts and merged-cell layouts. Validated 10/10 on real documents by an independent Opus-4.8 faithfulness judge (see references/validation.md).
How this skill is triggered — by the user, by Claude, or both
Slash command
/document-processor:document-processorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Convert business documents to **faithful markdown** with a **structure-first + selective
references/faithfulness_judge.mdreferences/method_and_gotchas.mdreferences/validation.mdscripts/common.pyscripts/convert_pdf.pyscripts/convert_pptx.pyscripts/convert_pptx_v2.pyscripts/docproc_env.pyscripts/ingest.pyscripts/judge_summary.pyscripts/prepare_judge.pyscripts/prompts.pyscripts/render_pages.pyConvert business documents to faithful markdown with a structure-first + selective
Vision-Language-Model (VLM) method. The guiding principle: OCR/plain text extraction looks
like it works but silently drops charts, racing/stacked bars, merged-cell "dashboard" xlsx, and
mixed-language layout. This skill extracts semantic structure — charts → JSON/markdown, tables
→ HTML, verbatim text, [UNCLEAR: …] on doubt — and only sends genuinely-visual pages to the VLM,
keeping text-dominant pages on a cheap, exact native path.
It also ships a built-in faithfulness self-check: an independent Opus-4.8 sub-agent reads the original page image and the produced markdown and grades fidelity (a different model from the ingester = an honest cross-check). This is the same procedure that validated the method 10/10.
Any "document → markdown/text" task where the content matters: slide decks, research/market reports, brand-health studies, UT (user-test) reports, campaign trackers, spreadsheet dashboards, PDFs. Especially chart-/image-heavy or Vietnamese material. If the user just needs a couple of lines of plain text from a simple text PDF, the native path here still handles it cheaply — but reach for this skill the moment fidelity of charts/tables/numbers is in play.
uv (3.12). Python deps: langchain-openai, langchain-core, python-pptx,
python-docx, pypdfium2, pillow, python-dotenv, httpx, openpyxl.soffice (LibreOffice) — converts pptx/xlsx/docx → pdf for rendering. PDFs need NO soffice
(pypdfium2 renders them directly).poppler (pdftotext, pdfinfo) — native PDF text + page counts.OPENROUTER_API_KEY in the environment or ~/dev/.env (read automatically). The VLM is
Qwen3-VL via OpenRouter. The Opus-4.8 verifier runs on the Claude subscription (no API key).Install deps once, e.g.: uv venv && uv pip install langchain-openai langchain-core python-pptx python-docx pypdfium2 pillow python-dotenv httpx openpyxl. Run scripts with that interpreter.
# one file or a whole directory -> faithful markdown in <out>/<stem>.md
python scripts/ingest.py --in /path/to/file_or_dir --out /path/to/out
Routing is automatic by extension:
convert_pptx_v2.py): text/table slides render natively (no LLM,
verbatim); chart/picture/visual slides → Qwen3-VL.convert_pdf.py): pdftotext per page; low-text pages → Qwen3-VL.Output carries per-unit markers <!-- slide N | route: native|vlm --> / <!-- page N | route: … -->
(parse prefix-only; the suffix varies). Per-file cost/routing is logged to <out>/_ingest_log.jsonl.
The default model is Qwen3-VL-32B (DOC_VLM_MODEL=medium) — fast, cheap, and faithful on charts,
tables, numbers, and text. But it under-enumerates dense same-shape cluster visuals (honeycomb
diagrams, packed icon grids): in validation it dropped 2 of 14 hexagon labels at both 2× and 3×
DPI — a model-capacity limit, not a resolution one. The 235B flagship captured all 14.
Rule: when a document is heavy in dense cluster/diagram visuals, or the built-in verifier flags a miscount / dropped diagram element, re-run with
DOC_VLM_MODEL=flagship. Do NOT just raise the DPI — that does not fix enumeration. Escalating the model does.
DOC_VLM_MODEL=flagship python scripts/ingest.py --in <file> --out <out> # for the hard ones
DOC_VLM_SCALE=3.0 python scripts/ingest.py --in <file> --out <out> # optional: denser charts
The ingester is a VLM; verify its output with an independent Opus-4.8 judge that reads the original page image (ground truth) vs the produced markdown. Numbers are zero-tolerance.
python scripts/render_pages.py <src-file> <verify_dir>/<stem>
python scripts/prepare_judge.py --md <out>/<stem>.md \
--render <verify_dir>/<stem>/pngs --out <verify_dir>/<stem>
model: opus)
and give it the manifest at <verify_dir>/<stem>/manifest.json. Use the exact contract in
references/faithfulness_judge.md (image = ground truth; chart/table/research numbers zero-
tolerance; fabrication = fail; imprecise transcription of tiny printed fine print = minor —
zoom to verify before calling it fabrication). It writes <verify_dir>/<stem>.verdict.json.python scripts/judge_summary.py --dir <verify_dir>
A file passes iff pass-rate ≥ 0.90 and no number-error on research data and no hard fail.
If a file fails on a dense-diagram miscount → apply the escalation rule and re-ingest + re-judge.soffice is only needed for pptx/xlsx/docx. PDFs need no soffice. If soffice is missing or hangs,
still process PDFs and report which pptx/xlsx/docx were skipped — optionally offload their rendering
to a Linux host (e.g. over ssh) where soffice works.
macOS gotcha: headless soffice --convert-to can hang forever with an empty log even after
xattr -dr com.apple.quarantine /Applications/LibreOffice.app — a Gatekeeper prompt headless can't
dismiss. Workarounds: open LibreOffice.app in the GUI once to approve it, then quit; or render
pptx/xlsx/docx on a Linux host. Always clear quarantine first (necessary, sometimes insufficient).
Per slide/page, classify_slide_v2 (in convert_pptx_v2.py) routes to the VLM when: a native chart
shape is present; a picture covers ≥ 4% of the slide (or pictures total ≥ 6%) — this catches charts
embedded as images; grouped visual slides; or there are no usable native shapes. Otherwise text/
table-only slides stay native (verbatim, no LLM). Hidden slides are mapped to their rendered pages by
visible-slide order so a deck with hidden slides is not silently dropped to native (a real bug in
the naive approach — see references/validation.md).
references/faithfulness_judge.md — the exact Opus-4.8 judge contract + ship-bar.references/method_and_gotchas.md — full routing rules, the extraction-prompt faithfulness contract
(charts→JSON, tables→HTML, verbatim, [UNCLEAR], count-first, anti-fabrication), OpenRouter/Qwen
config, model tiers, and hard-won pitfalls.references/validation.md — evidence: how the method was validated 10/10 on real documents and the
bugs the judge caught (hidden-slide drop, small-chart misroute, VLM miscounts, value fabrication).Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub hungson175/document-processor-skill --plugin document-processor