From doc-to-notes
Convert .docx / .doc / .pdf training or learning documents into structured, up-to-date Obsidian Markdown notes. Scripts parse headings/code/lists/tables/images and split the doc into per-chapter JSON; images upload to Aliyun OSS; oversized screenshots are auto-resized, OCR'd (Apple Vision) and visually analyzed (architecture→Mermaid, code screenshot→code block, data screenshot→table); content is re-baselined to the latest stable version against official docs (concepts/API/config/terminology taught in the new version's voice, old version kept only as migration notes); output is split into one Markdown file per chapter. Big-data tech (Flink/Hadoop/Spark/Kafka) routes to 214_Big_Data. Use when user provides a .docx/.doc/.pdf path to turn into knowledge base notes, mentions 资料转换 / 培训文档整理 / 学习笔记, or processes training materials (e.g. 多易大数据, Flink/Spark/Kafka internal docs).
How this skill is triggered — by the user, by Claude, or both
Slash command
/doc-to-notes:doc-to-notesThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Convert a large `.docx` / `.doc` / `.pdf` learning document into multiple structured
Convert a large .docx / .doc / .pdf learning document into multiple structured
Obsidian Markdown notes — one per chapter — with code blocks, Mermaid diagrams,
OSS-hosted images, and content refreshed to the latest official version.
The Cloudflare timeout is 120s. A single large Write/Edit will time out and corrupt state. Therefore:
chapter_NN.json), never the whole manifest.### section per Edit.Never batch multiple chapters, never write a full chapter body in one call.
pip install python-docx oss2 pymupdf pillow ocrmac
sipssource ~/.zprofile && python3 __SKILL_DIR__/scripts/extract_docx.py \
"/path/to/document.docx"
Handles .docx, .doc (auto-converts via textutil), and .pdf. Outputs to
/tmp/doc_notes_<name>/:
manifest.json — full structure (headings, code, lists, tables, images w/ dimensions)chapter_NN.json — one file per chapter (auto-split at H2, or H3 if chapters are huge)images/ — extracted images, resized to ≤2000pxRead the printed summary: title, chapter list, section-type counts. Note the chapter count — you'll process exactly that many files.
| Doc topic | Target |
|---|---|
| Big data (Flink, Hadoop, Spark, Kafka, Hive, HBase, Zookeeper) | 210_Dev_Stack/214_Big_Data/<Tech>/ |
| Middleware (Redis, RabbitMQ, Nginx, Dubbo, ES) | 210_Dev_Stack/213_Middleware/<Tech>/ |
| Language-specific (Java, Python) | 210_Dev_Stack/212_Java_Expert/ or 211_Python_Expert/ |
| Infra / ops (K8s, Docker, Linux) | 230_Infra_Ops/ |
| Platform course series (GeekTime etc.) | 260_Courses/ |
| Ambiguous | ask the user |
Big data lives in its own
214_Big_Data/, a sibling of213_Middleware/— not inside Middleware. Each technology gets its own sub-dir:214_Big_Data/Flink/,214_Big_Data/Hadoop/,214_Big_Data/Spark/, … New tech dirs are created as needed.
Output structure — one sub-dir per source doc, one .md per top-level chapter, plus
an index. A typical doc yields just a few files:
214_Big_Data/Flink/01_Flink基础/
├── 00-索引.md # MOC: links every chapter in order
├── 01-快速认识Flink.md # = H2 "1.快速认识flink"
├── 02-环境准备与编程入门.md # = H2 "2.Flink环境准备和编程入门"
└── 03-DataStream编程基础.md # = H2 "3.DataStream编程基础" (large → fill ### by ###)
source ~/.zprofile && python3 __SKILL_DIR__/scripts/upload_oss.py \
/tmp/doc_notes_<name>/images/
Outputs url_mapping.json → {filename: oss_url}. Keyed by content md5 (idempotent —
safe to re-run, no duplicate uploads).
3a — OCR baseline (script, one pass). Run OCR over all images to get a text baseline and a code-likelihood score per image:
source ~/.zprofile && python3 __SKILL_DIR__/scripts/ocr_image.py \
/tmp/doc_notes_<name>/images/ --json
Outputs ocr_text.json → {file: {text, code_score, code_like}}. code_like:true (high
symbol density) flags likely code / UI screenshots vs diagrams.
3b — Vision + correction. Images are NOT blindly embedded. For each image, Read it
(vision) AND consult its ocr_text.json entry, then decide:
| Image content | Action |
|---|---|
| Architecture / topology diagram | Redraw as Mermaid (don't embed) |
| Flow / sequence diagram | Redraw as Mermaid |
| Code screenshot | Transcribe into a code block — use OCR text as the baseline, then fix indentation/symbols against the image (OCR alone mangles {} ; → and indent) |
| Data / table screenshot | Rebuild as a Markdown table |
| UI / config screenshot (operation demo) | Embed OSS URL + [!INFO] caption ≥3 sentences |
| Decorative / logo | Skip |
Why both: Apple Vision OCR is fast and gives a text baseline, but errs on dense code; your vision Read corrects it. Two signals beat one, and you avoid re-typing long code.
⚠️ Context control: Read at most 4 images, write a compact summary table (filename · type · key info · decision), then read the next batch. This clears image base64 from context and prevents request-body overflow.
--no-version-update)The latest stable version is the PRIMARY teaching baseline — not the doc's old version. The notes explain every concept, term, API, config, and recommended practice as it works in the current stable release. The source doc's version survives only as migration hints for a reader who might still meet old code. This is a re-write to the new version, not a set of warnings bolted onto the old narrative.
WebSearch it, then WebFetch the official current
docs (the actual feature pages, not just release notes) for the features each chapter
covers. You are rewriting to this version, so you must read how today's docs actually
present these features — never rely on memory, never invent APIs/behavior/defaults.[!WARNING]-flagged migration notes. Genuinely new features the doc never
covered may be added as brief [!INFO] notes where relevant.Skip this entire re-baseline only when invoked with
--no-version-update, which keeps the notes faithful to the doc's original version.
python3 __SKILL_DIR__/scripts/suggest_diagrams.py \
/tmp/doc_notes_<name>/chapter_NN.json
Prints per-chapter suggestions: which sections describe architecture / flow / state / comparison / hierarchy, and the matching diagram type + scaffold. Use as hints — you draw one diagram per concept (see REFERENCE.md for rules & templates).
Loop over chapter_01.json … chapter_NN.json. For each:
6a — Skeleton (single Write, fast): frontmatter + version callout + ### headings
with <!-- FILL --> placeholders. Use the chapter's parent field for the breadcrumb.
---
title: "[chapter heading]"
source_doc: "[original filename]"
source_version: "Flink 1.15"
current_version: "Flink 1.20"
tags: [flink, big-data, streaming]
date: <date +%Y-%m-%d>
---
> 所属章节:[[00-索引]] · 上级:[parent]
> [!INFO] 版本说明
> 本笔记已按**当前最新稳定版 A.B** 重新梳理讲解(概念、API、配置、术语、推荐做法均为新版);原文档基于 X.Y。旧版差异以 [!WARNING] 迁移提示标注。
## [subsection]
<!-- FILL -->
6b — Fill (one Edit per ### section): replace each <!-- FILL --> with structured
content written in the latest version's voice (per Step 4's re-baseline map). The old
version is never the main subject — the current stable release is.
[!WARNING]- 旧版(X.Y)写法 note saying what it was and why it
changed. Keep it secondary to the main narrative.url_mapping.json. One section per Edit, in order.\n — always <br/> for line breaks inside labels
(A["Line1<br/>Line2"], not A["Line1\nLine2"]).Verify every re-baselined API/behavior against Step 4's fetched official docs before writing it. When unsure whether the new version changed something, fetch and confirm — never guess a "modern" API into existence.
6c — Chapter-end exercise WITH answers (mandatory). End each chapter with a
[!QUESTION] callout (2–4 thinking questions) immediately followed by a collapsible
[!SUCCESS]- 参考答案 callout that answers every single question. A question without a
reference answer is an incomplete chapter — never ship one.
The answers are the highest-value part of the note: they must be complete, correct, and
current (matching the latest version from Step 4), not one-line hand-waves. For each
question give the direct conclusion first, then the why (mechanism / principle), and
where useful a code snippet, a comparison, or a file:line-style pointer to the relevant
section above. Verify any version-specific API in the answer against Step 4's findings —
do not invent APIs. Match the questions to the chapter's actual content (don't ask
about something the chapter never covered). See REFERENCE.md → "Chapter-End Exercise +
Answer" for the exact format and a worked example.
Large chapter (the script prints a heads-up, e.g. 256 sections): this is normal and safe — the file is big but each Edit is small. Build the skeleton from its
###(H3) headings, then fill one###at a time. If a single###is itself huge (many H4 items, e.g. 50+ operators), fill it in several Edits (a few H4 items each) rather than one. Never write the whole chapter body in one call.
00-索引.md: a MOC linking every chapter [[NN-title]] in order, with a 1-line
summary each, and a top-level Mermaid overview of the whole doc's structure.for f in <out_dir>/[0-9]*.md; do
q=$(grep -c '\[!QUESTION\]' "$f"); a=$(grep -c '\[!SUCCESS\]' "$f")
[ "$q" -gt 0 ] && [ "$a" -eq 0 ] && echo "⚠️ $f 有思考题但缺参考答案"
done
| Arg | Default | Description |
|---|---|---|
doc_path | — | Absolute path to .docx / .doc / .pdf |
--output-dir | /tmp/doc_notes_<name> | Override extraction output dir |
--max-img-px | 2000 | Resize images larger than this (any side) |
--split-level | auto | Chapter split level: auto / 2 / 3 |
--min-sections | 15 | Merge small same-parent chapters below this size (0 disables) |
Default: one Markdown file per top-level chapter (H2). A document's 1. 2. 3.
headings are the natural units, so a typical doc becomes just a few files (e.g. Flink
基础 → 3 files), not a swarm. Falls back to H1/H3 only when H2 is absent.
A large chapter is NOT auto-split. The 120s timeout limits a single Edit, not a file. A big chapter (e.g. 256 sections) is written safely via skeleton → per-
###Edit fill. The script prints a heads-up for oversized chapters; only if one is unwieldy, re-run that doc with--split-level 3to break it up.
--split-level 2 (default auto) / 3 — force coarser or finer.--min-sections 15 — if you ever opt into H3 splitting, small same-parent chapters merge
(carrying a headings list like 1.1 1.2 1.3) so you don't get tiny files.A "section" = one heading / paragraph / code block / list item / table / image (not words). Content is conserved exactly across all chapter files (nothing dropped or duplicated).
for f in "/path/to/资料"/*.docx; do
[[ "$f" == *"(1).docx" ]] && continue # skip duplicate copies
python3 .../extract_docx.py "$f"
done
| Situation | Handling |
|---|---|
.doc (legacy binary) | Auto-converted to .docx via textutil (macOS); else asks user to convert |
| Code in single-cell tables (多易 style) | Detected → code block with language from first line |
| Code as monospace paragraphs (other vendors) | Detected by font name → code block |
| No font-size headings (structure via numbering) | Falls back to 1./1.1 numbering depth |
| No H2 chapters at all | Splits at H1, then H3; if none, single file (warns) |
| Huge chapter (>90 sections) | auto split drops to H3 for manageable files |
| Image > 2000px | Auto-resized before upload & visual analysis |
| Duplicate image in doc | Deduped (md5) — uploaded once |
| PDF prose fragmented | Spans merged per block, not per span |
| Image embedded in table cell | Extracted as image, not lost |
| Code only exists as a screenshot | OCR baseline (ocr_image.py) + vision correction → code block |
| Scanned PDF (no text layer) | extract_docx.py yields few sections; run ocr_image.py on page images |
See REFERENCE.md for Mermaid rules, the diagram decision table, language mapping, callout formats, and the per-chapter quality checklist.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub skytechfyh/ai-marketplace --plugin doc-to-notes