From doc-harvest
Scrape a documentation website and organize it into the Claude Context Library
How this command is triggered — by the user, by Claude, or both
Slash command
/doc-harvest:harvest-docsThis command is limited to the following tools:
The summary Claude sees in its command listing — used to decide when to auto-load this command
# Harvest Documentation Command Scrape a documentation website, clean the content, and organize it into the Claude Context Library for progressive-disclosure navigation. ## Instructions ### 1. Get the Target URL If the user didn't provide a URL, ask: - What documentation site to harvest - Any specific sections to include or exclude ### 2. Parse the URL and Suggest a Slug Extract the domain and path prefix from the URL. Suggest a slug name: - `developer.apple.com/design/human-interface-guidelines/` → `apple-hig` - `docs.tailscale.com/` → `tailscale-docs` - `react.dev/reference/` → `rea...
Scrape a documentation website, clean the content, and organize it into the Claude Context Library for progressive-disclosure navigation.
If the user didn't provide a URL, ask:
Extract the domain and path prefix from the URL. Suggest a slug name:
developer.apple.com/design/human-interface-guidelines/ → apple-higdocs.tailscale.com/ → tailscale-docsreact.dev/reference/ → react-referenceRun the discovery script:
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/harvest.py" discover "<url>"
This will:
Show the user the discovered pages as a tree structure. Include:
Use AskUserQuestion to let the user:
contexts/technical/{slug}/)If 50+ pages: Use parallel doc-harvester agents (see Section Analysis below).
If < 50 pages: Process interactively:
For each approved page:
a. Fetch content — try WebFetch first:
WebFetch <url>
b. If WebFetch fails or returns poor content, fall back to curl + pandoc:
curl -sL "<url>" | pandoc -f html -t markdown --wrap=none
c. Clean the raw markdown:
echo '<raw_markdown>' | python3 "${CLAUDE_PLUGIN_ROOT}/scripts/harvest.py" clean --base-url "<domain>"
d. Report progress: [n/total] Fetched: <title>
e. Wait 2 seconds between requests.
When the approved page count is ≥ 50:
Group pages by section — extract the first URL path segment after the base prefix:
apple-hig/foundations/color → section foundationsapple-hig/getting-started → section getting-started (treat as its own group)rootParallelization decision:
doc-harvester agent (same as before)Binning rules (max 6 agents):
misc groupmisc group has pages, it becomes one agentWrite the root _index.md first (before launching any agents):
Show the proposed split to the user before launching:
Launching 4 parallel harvester agents:
Agent 1: foundations — 18 pages
Agent 2: components — 63 pages
Agent 3: patterns — 25 pages
Agent 4: misc (inputs, getting-started, technologies) — 49 pages
Launch all agents in a single message (multiple parallel Agent tool calls). Each agent receives:
section_name: the section label (for progress logs like [3/18] foundations/color)pages: the filtered manifest (only pages for this section)slug, display_name, ccl_path: same as the full harvestskip_index_update: true — agent must NOT run update-index or update known-docs.mdAfter all agents complete, run update-index and update known-docs.md once (steps 8–9 below).
Generate the file tree plan:
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/harvest.py" organize --slug "<slug>" --manifest "<manifest_path>"
Review the plan, then write files with --write:
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/harvest.py" organize --slug "<slug>" --manifest "<manifest_path>" --write --ccl-root "<ccl_path>"
The CCL root path is:
~/Library/Mobile Documents/iCloud~md~obsidian/Documents/Claude Context Library/
For each page, write the cleaned content to its CCL location. Each file needs frontmatter:
---
name: <Page Title> - <Doc Set Name>
source_url: <original_url>
parent: <section>/_index.md
last_updated: <today's date>
---
The root _index.md gets additional metadata:
---
name: <Doc Set Display Name>
domain: technical
source_url: <base_url>
harvested: <today's date>
page_count: <count>
triggers:
- "<keyword1>"
- "<keyword2>"
related_contexts: []
last_updated: <today's date>
---
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/harvest.py" update-index \
--ccl-root "~/Library/Mobile Documents/iCloud~md~obsidian/Documents/Claude Context Library/" \
--slug "<slug>" \
--name "<display_name>"
Add an entry to the doc-navigator registry:
Edit ${CLAUDE_PLUGIN_ROOT}/skills/doc-navigator/references/known-docs.md
Add a row to the table:
| <Name> | <slug> | contexts/technical/<slug>/ | <source_domain> | <page_count> | <today's date> |
Show the user:
/harvest-docs <url> again to update."npx claudepluginhub mattwag05/mw-plugins --plugin doc-harvest