Skill

document-processor

Ingest business documents (.pptx, .pdf, .xlsx, .docx) into FAITHFUL markdown for a RAG knowledge base or review — VLM-first (Vision-Language Model), NOT OCR. Use this whenever the user wants to convert, ingest, extract, or "turn into markdown/text" any slide deck, report, spreadsheet dashboard, or PDF — ESPECIALLY chart-, dashboard-, or image-heavy and Vietnamese/mixed-language documents where chart numbers, tables, and verbatim text must survive. Trigger on phrases like "convert this deck/report/pptx/pdf to markdown", "ingest these documents", "extract the data/tables/charts from this presentation", "build a knowledge base from these files", or any document→markdown task where faithfulness matters. Prefer this over a plain text-extraction/OCR approach, which silently drops charts and merged-cell layouts. Validated 10/10 on real documents by an independent Opus-4.8 faithfulness judge (see references/validation.md).

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/document-processor:document-processor

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Convert business documents to **faithful markdown** with a **structure-first + selective

Supporting Files

references/faithfulness_judge.mdreferences/method_and_gotchas.mdreferences/validation.mdscripts/common.pyscripts/convert_pdf.pyscripts/convert_pptx.pyscripts/convert_pptx_v2.pyscripts/docproc_env.pyscripts/ingest.pyscripts/judge_summary.pyscripts/prepare_judge.pyscripts/prompts.pyscripts/render_pages.py

SKILL.md

131 lines · ~2.1k tokens

Stats

LanguagePython

Stars0

MaintenanceGood

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

document-processor

Convert business documents to faithful markdown with a structure-first + selective Vision-Language-Model (VLM) method. The guiding principle: OCR/plain text extraction looks like it works but silently drops charts, racing/stacked bars, merged-cell "dashboard" xlsx, and mixed-language layout. This skill extracts semantic structure — charts → JSON/markdown, tables → HTML, verbatim text, [UNCLEAR: …] on doubt — and only sends genuinely-visual pages to the VLM, keeping text-dominant pages on a cheap, exact native path.

It also ships a built-in faithfulness self-check: an independent Opus-4.8 sub-agent reads the original page image and the produced markdown and grades fidelity (a different model from the ingester = an honest cross-check). This is the same procedure that validated the method 10/10.

When to use

Any "document → markdown/text" task where the content matters: slide decks, research/market reports, brand-health studies, UT (user-test) reports, campaign trackers, spreadsheet dashboards, PDFs. Especially chart-/image-heavy or Vietnamese material. If the user just needs a couple of lines of plain text from a simple text PDF, the native path here still handles it cheaply — but reach for this skill the moment fidelity of charts/tables/numbers is in play.

Dependencies

Python via uv (3.12). Python deps: langchain-openai, langchain-core, python-pptx, python-docx, pypdfium2, pillow, python-dotenv, httpx, openpyxl.
soffice (LibreOffice) — converts pptx/xlsx/docx → pdf for rendering. PDFs need NO soffice (pypdfium2 renders them directly).
poppler (pdftotext, pdfinfo) — native PDF text + page counts.
OPENROUTER_API_KEY in the environment or ~/dev/.env (read automatically). The VLM is Qwen3-VL via OpenRouter. The Opus-4.8 verifier runs on the Claude subscription (no API key).

Install deps once, e.g.: uv venv && uv pip install langchain-openai langchain-core python-pptx python-docx pypdfium2 pillow python-dotenv httpx openpyxl. Run scripts with that interpreter.

Quick start — ingest

# one file or a whole directory -> faithful markdown in <out>/<stem>.md
python scripts/ingest.py --in /path/to/file_or_dir --out /path/to/out

Routing is automatic by extension:

pptx → structure-first (convert_pptx_v2.py): text/table slides render natively (no LLM, verbatim); chart/picture/visual slides → Qwen3-VL.
pdf → structure-first (convert_pdf.py): pdftotext per page; low-text pages → Qwen3-VL.
xlsx / docx → rendered page-by-page → Qwen3-VL (these are visual/layout documents — xlsx in this corpus are merged-cell dashboards, not dataframes).

Output carries per-unit markers  /  (parse prefix-only; the suffix varies). Per-file cost/routing is logged to <out>/_ingest_log.jsonl.

The escalation rule (the key learned lever)

The default model is Qwen3-VL-32B (DOC_VLM_MODEL=medium) — fast, cheap, and faithful on charts, tables, numbers, and text. But it under-enumerates dense same-shape cluster visuals (honeycomb diagrams, packed icon grids): in validation it dropped 2 of 14 hexagon labels at both 2× and 3× DPI — a model-capacity limit, not a resolution one. The 235B flagship captured all 14.

Rule: when a document is heavy in dense cluster/diagram visuals, or the built-in verifier flags a miscount / dropped diagram element, re-run with DOC_VLM_MODEL=flagship. Do NOT just raise the DPI — that does not fix enumeration. Escalating the model does.

DOC_VLM_MODEL=flagship python scripts/ingest.py --in <file> --out <out>   # for the hard ones
DOC_VLM_SCALE=3.0       python scripts/ingest.py --in <file> --out <out>   # optional: denser charts

Verify faithfulness (built-in self-check — do this, don't trust the ingester blindly)

The ingester is a VLM; verify its output with an independent Opus-4.8 judge that reads the original page image (ground truth) vs the produced markdown. Numbers are zero-tolerance.

Render representative original pages to PNG (same pixels the VLM saw; also writes a visible-slide map so hidden-slide decks align):
```
python scripts/render_pages.py <src-file> <verify_dir>/<stem>
```
Build a judge packet — selects representative units (VLM/chart pages prioritized; broad spread on fully-native decks to catch a chart misrouted to native and dropped):
```
python scripts/prepare_judge.py --md <out>/<stem>.md \
  --render <verify_dir>/<stem>/pngs --out <verify_dir>/<stem>
```
Judge with an independent Opus-4.8 sub-agent. Spawn a sub-agent (Agent tool, model: opus) and give it the manifest at <verify_dir>/<stem>/manifest.json. Use the exact contract in references/faithfulness_judge.md (image = ground truth; chart/table/research numbers zero- tolerance; fabrication = fail; imprecise transcription of tiny printed fine print = minor — zoom to verify before calling it fabrication). It writes <verify_dir>/<stem>.verdict.json.
Aggregate the ship-bar:
```
python scripts/judge_summary.py --dir <verify_dir>
```
A file passes iff pass-rate ≥ 0.90 and no number-error on research data and no hard fail. If a file fails on a dense-diagram miscount → apply the escalation rule and re-ingest + re-judge.

soffice availability — degrade gracefully, never hard-fail

soffice is only needed for pptx/xlsx/docx. PDFs need no soffice. If soffice is missing or hangs, still process PDFs and report which pptx/xlsx/docx were skipped — optionally offload their rendering to a Linux host (e.g. over ssh) where soffice works.

macOS gotcha: headless soffice --convert-to can hang forever with an empty log even after xattr -dr com.apple.quarantine /Applications/LibreOffice.app — a Gatekeeper prompt headless can't dismiss. Workarounds: open LibreOffice.app in the GUI once to approve it, then quit; or render pptx/xlsx/docx on a Linux host. Always clear quarantine first (necessary, sometimes insufficient).

How routing decides (so you can reason about edge cases)

Per slide/page, classify_slide_v2 (in convert_pptx_v2.py) routes to the VLM when: a native chart shape is present; a picture covers ≥ 4% of the slide (or pictures total ≥ 6%) — this catches charts embedded as images; grouped visual slides; or there are no usable native shapes. Otherwise text/ table-only slides stay native (verbatim, no LLM). Hidden slides are mapped to their rendered pages by visible-slide order so a deck with hidden slides is not silently dropped to native (a real bug in the naive approach — see references/validation.md).

More detail

references/faithfulness_judge.md — the exact Opus-4.8 judge contract + ship-bar.
references/method_and_gotchas.md — full routing rules, the extraction-prompt faithfulness contract (charts→JSON, tables→HTML, verbatim, [UNCLEAR], count-first, anti-fabrication), OpenRouter/Qwen config, model tiers, and hard-won pitfalls.
references/validation.md — evidence: how the method was validated 10/10 on real documents and the bugs the judge caught (hidden-slide drop, small-chart misroute, VLM miscounts, value fabrication).

document-processor

Invocation

Context Preview

Supporting Files

SKILL.md

document-processor

Invocation

Context Preview

Supporting Files

SKILL.md

document-processor

When to use

Dependencies

Quick start — ingest

The escalation rule (the key learned lever)

Verify faithfulness (built-in self-check — do this, don't trust the ingester blindly)

soffice availability — degrade gracefully, never hard-fail

How routing decides (so you can reason about edge cases)

More detail

Similar Skills

document-processor

When to use

Dependencies

Quick start — ingest

The escalation rule (the key learned lever)

Verify faithfulness (built-in self-check — do this, don't trust the ingester blindly)

soffice availability — degrade gracefully, never hard-fail

How routing decides (so you can reason about edge cases)

More detail

Similar Skills