From crucible
Converts heavy document formats (PDF, Word, Excel, PowerPoint, and others) to token-efficient Markdown/CSV with structurally-aware digest compression. Use when Claude needs to read documents without excessive context budget.
How this skill is triggered — by the user, by Claude, or both
Slash command
/crucible:distillThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
<!-- CANONICAL: shared/dispatch-convention.md -->
All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.
Convert heavy document formats to token-efficient representations (Markdown, CSV) for LLM consumption. The core deliverable is the .digest.md — a structurally-aware compression at 20-30% of token count.
Skill type: Rigid — follow exactly, no shortcuts.
Models:
Announce at start: "I'm using the distill skill to convert documents to token-efficient formats."
/distill <path> [path2 ...]
/distill <directory>
Examples:
/distill docs/report.pdf — convert one file/distill docs/report.pdf data/sheet.xlsx slides/deck.pptx — convert multiple files/distill docs/ — convert all supported files in directory (single-level, not recursive)Mixed mode is supported: /distill docs/ extra/report.pdf
Execute phases in this order. Each phase completes for all files before the next begins.
At skill start, before processing any files, check for required tools:
| Check | Command | If Missing |
|---|---|---|
| Tier 1 | which pandoc | "pandoc not found. Install: apt install pandoc (Debian/Ubuntu) or brew install pandoc (macOS). Tier 1 formats will be skipped." |
| Tier 2 | which pdftotext | "pdftotext not found. Install: apt install poppler-utils (Debian/Ubuntu) or brew install poppler (macOS). PDF conversion will be skipped." |
| Tier 3 | which python3 | "python3 not found. PPTX and XLSX conversion will be skipped." |
| Pre-flight | which unzip | Skip zip bomb detection with note. Not a conversion blocker. |
| Pre-flight | which pdfdetach | Skip PDF attachment detection with note. Not a conversion blocker. |
Build a set of available tiers. Route files only to available tiers. Files targeting unavailable tiers get routed to unsupported-with-guidance (Phase 1b).
Individual file paths: Use directly. Verify each file exists.
Directory paths: Single-level glob for files with supported extensions (not recursive). Build file list sorted alphabetically. Report: "Found {N} convertible files in {directory}: {list}."
Supported extensions for glob: .pdf, .docx, .rtf, .html, .htm, .odt, .epub, .rst, .org, .tex, .ipynb, .pptx, .xlsx
Mixed mode: Process both directory globs and individual paths. Deduplicate by absolute path.
For each file, determine the conversion tier by extension:
| Extension | Tier | Format Flag |
|---|---|---|
.docx | 1 | docx |
.rtf | 1 | rtf |
.html | 1 | html |
.htm | 1 | html |
.odt | 1 | odt |
.epub | 1 | epub |
.rst | 1 | rst |
.org | 1 | org |
.tex | 1 | latex |
.ipynb | 1 | ipynb |
.pdf | 2 | — |
.pptx | 3 | — |
.xlsx | 3 | — |
Unsupported formats: Output actionable guidance per this table, then continue with remaining files:
| Extension | Guidance |
|---|---|
.xls | "Legacy Excel format. Export as .xlsx from Excel/LibreOffice, then re-run /distill." |
.ods | "OpenDocument Spreadsheet. Export as .csv (single-sheet) or .xlsx (multi-sheet), then re-run /distill." |
.odp | "OpenDocument Presentation. Export as .pptx, then re-run /distill." |
.key | "Apple Keynote. Export as .pptx from Keynote, then re-run /distill." |
.numbers | "Apple Numbers. Export as .xlsx from Numbers, then re-run /distill." |
.pages | "Apple Pages. Export as .docx from Pages, then re-run /distill." |
Unknown extensions: "Unsupported format: {ext}. Supported formats: docx, rtf, html, odt, epub, rst, org, tex, ipynb, pdf, pptx, xlsx."
Unavailable tier: If a file's tier is unavailable (tool missing from Phase 0), report: "{file}: requires {tool} (not installed). Skipping."
Run per-file safety checks before conversion. Failures are per-file — do not halt the batch.
Office formats are ZIP archives. If unzip is available:
UNCOMPRESSED=$(unzip -l "$INPUT_PATH" 2>/dev/null | tail -1 | awk '{print $1}')
If uncompressed size exceeds 500MB (524288000 bytes), abort this file: "File uncompressed size ({size}) exceeds 500MB safety limit. Skipping."
If unzip is not available, skip this check (noted in Phase 0).
For PDF files, if pdfdetach is available:
ATTACHMENTS=$(pdfdetach -list "$INPUT_PATH" 2>/dev/null | grep -c "^[0-9]")
If attachments found, warn: "PDF contains {N} embedded attachments. These are not extracted — only text content is converted." Continue with conversion.
After conversion (not before), verify output is valid UTF-8:
file --mime-encoding "$OUTPUT_PATH"
If not UTF-8, attempt re-encoding: iconv -f <detected-charset> -t UTF-8 "$OUTPUT_PATH" -o "$OUTPUT_PATH.tmp" && mv "$OUTPUT_PATH.tmp" "$OUTPUT_PATH". If re-encoding fails, report and skip.
Process files sequentially. For each file:
INPUT_PATH="$1"
OUTPUT_PATH="${INPUT_PATH%.*}.md"
FORMAT="$2" # from routing table
pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"
Shell safety: All file paths via quoted shell variables. Never inline interpolation. Never use unquoted $() or backtick interpolation of file paths.
Error handling:
Idempotency: Overwrites existing output files without warning.
Step 1 — Extract:
INPUT_PATH="$1"
TEXT_PATH="${INPUT_PATH%.*}.txt"
OUTPUT_PATH="${INPUT_PATH%.*}.md"
pdftotext -layout "$INPUT_PATH" "$TEXT_PATH"
Scanned PDF detection: Count total characters and pages:
CHARS=$(wc -c < "$TEXT_PATH")
PAGES=$(pdfinfo "$INPUT_PATH" 2>/dev/null | grep "^Pages:" | awk '{print $2}')
If pdfinfo is unavailable, estimate pages from pdftotext output (count form-feed characters). If average chars/page < 50, report: "This PDF appears to be scanned/image-based. Text extraction produced minimal content. Consider OCR processing externally before distilling." Skip structuring pass. Clean up temp .txt file.
Step 2 — Structure: Dispatch a Sonnet agent using skills/distill/pdf-structurer-prompt.md to transform the raw pdftotext output into clean Markdown with recovered headings, lists, tables, and code blocks. Write result to OUTPUT_PATH. Clean up temp .txt file.
Venv setup (once per invocation, only if Tier 3 files exist):
VENV="/tmp/crucible-distill-venv"
# Health check
if [ -d "$VENV" ]; then
"$VENV/bin/python3" -c "import sys" 2>/dev/null || rm -rf "$VENV"
fi
# Create if missing
if [ ! -d "$VENV" ]; then
echo "Installing Python dependencies (one-time setup, ~15 seconds)..."
python3 -m venv "$VENV"
"$VENV/bin/pip" install --quiet python-pptx==1.0.2 openpyxl==3.1.5
if [ $? -ne 0 ]; then
echo "Failed to install Python dependencies."
echo "Manual install: pip install python-pptx==1.0.2 openpyxl==3.1.5"
echo "PPTX and XLSX conversion will be skipped."
# Route remaining Tier 3 files to unsupported
return
fi
fi
PPTX conversion:
"$VENV/bin/python3" skills/distill/convert_pptx.py --input "$INPUT_PATH" --output "$OUTPUT_PATH"
XLSX conversion:
"$VENV/bin/python3" skills/distill/convert_xlsx.py --input "$INPUT_PATH" --output-dir "$(dirname "$INPUT_PATH")"
Output: one CSV per sheet at {basename}-{sheetname}.csv. Sheetnames sanitized (spaces → hyphens, special chars stripped).
After all conversions complete, run the digest pass on eligible files.
Eligibility:
.md (not .csv)Dispatch: For each eligible file, dispatch a Sonnet digest agent using skills/distill/digest-prompt.md. Before dispatching, fill template placeholders: replace {{ORIGINAL_WORDS}} with the converted file's word count and {{TARGET_WORDS}} with 25% of that count. The raw pdftotext output (for pdf-structurer-prompt.md) or converted .md content (for digest-prompt.md) is included as a content block below the prompt template in the dispatch file.
Quality check: After the digest agent returns, count words in the digest:
Output: Write digest to {original-path-without-ext}.digest.md.
Word count is a proxy for token count. These diverge for code-heavy or CJK content, but word count is sufficient for v1.
After all conversions and digests complete, output:
## Distill Summary
| File | Format | Tier | Converted | Digest | Token Savings |
|---|---|---|---|---|---|
| {file} | {format} | {tier} | {output} ({words} words) | {digest} ({words} words) | ~{pct}% |
**Total:** {N} files converted, {M} digests produced, ~{pct}% average token savings on digestible content.
Generated files can be added to .gitignore if not needed in version control.
Token savings per file = 1 - (digest words / converted words) expressed as percentage.
Files that were skipped (unsupported, tool missing, pre-flight failure) are listed separately:
**Skipped:** {N} files
- {file}: {reason}
Every Bash command that touches file paths MUST use quoted shell variables:
# CORRECT
pandoc -f "$FORMAT" -t markdown --wrap=none "$INPUT_PATH" -o "$OUTPUT_PATH"
# WRONG — never do this
pandoc -f $FORMAT -t markdown --wrap=none $INPUT_PATH -o $OUTPUT_PATH
"$VAR", never bare $VAR$() or backtick interpolation of paths| Failure | Behavior |
|---|---|
| Tool not installed | Skip tier, report with install guidance, continue |
| Conversion fails (non-zero exit) | Report per-file, continue with remaining files |
| Empty conversion output | Report per-file, continue |
| Zip bomb detected | Skip file, report, continue |
| Scanned PDF | Report, skip digest, continue |
| Venv/pip failure | Skip Tier 3, report with manual install instructions |
| Digest out of range | One retry, accept second result regardless |
| File not found | Report, continue with remaining files |
| Permission denied | Report, continue |
| Encoding error | Attempt re-encode, skip on failure, continue |
Principle: Never halt the batch for a single file failure. Report and continue.
Standalone usage:
/distill <path> — convert one or more files/distill <directory> — convert all supported files in directoryCalled by:
Dispatches:
skills/distill/pdf-structurer-prompt.mdskills/distill/digest-prompt.mdDoes not dispatch: No quality gate, no red-team, no review loop. Distill is a utility skill — it converts and compresses. Quality is ensured by the digest quality metric (word count check + one retry).
npx claudepluginhub raddue/crucibleParses local files (PDF, DOCX, XLSX, HTML, etc.) into clean markdown on disk. Offers AI summaries and Q&A over document content.
Routes PDF conversions through analysis to select the best extraction strategy and tools based on document type and output format.
Converts PDFs, DOCX, PPTX, XLSX, HTML, images, URLs, CSV, JSON, and more to markdown via tiered fallbacks: MCP markitdown, native tools, or user notice. For ingesting non-plain-text files.