rawdoc
Fetch web pages as clean markdown for AI coding agents.

Single Go binary. One dependency (x/net/html). Fetches HTML, strips noise, outputs markdown. Works as a CLI, MCP server, and Claude Code plugin.
Install
Claude Code Plugin (recommended)
/install-plugin RandomCodeSpace/rawdoc
Adds /rawdoc and /rawdoc-crawl slash commands plus rawdoc_fetch and rawdoc_crawl MCP tools. The setup hook builds the binary automatically — requires Go 1.25+.
CLI
go install github.com/RandomCodeSpace/rawdoc@latest
MCP Server
rawdoc --serve
Runs as a JSON-RPC stdio server implementing the Model Context Protocol. Exposes rawdoc_fetch and rawdoc_crawl tools. See Manual MCP Setup below for configuration.
What It Does
- Fetches HTML via plain HTTP with browser-like headers
- Strips noise — scripts, styles, navbars, footers, ads, cookie banners, hidden elements
- Extracts main content using site-specific selectors or readability scoring
- Converts to clean markdown (headings, code blocks, tables, lists)
- Crawls linked pages when given a depth > 0
95%+ token reduction vs raw HTML. Works on server-rendered sites. JS-only SPAs are not supported.
Usage
# Single page → stdout
rawdoc https://kubernetes.io/docs/concepts/workloads/pods/
# Just the code blocks
rawdoc https://www.baeldung.com/spring-kafka --code-only
# JSON output with metadata
rawdoc https://pkg.go.dev/fmt -f json
# YAML output
rawdoc https://pkg.go.dev/fmt -f yaml
# Save to file
rawdoc https://example.com -o docs.md
# Crawl docs to a directory (depth=2, max 50 pages)
rawdoc https://kubernetes.io/docs/concepts/workloads/ -d 2 -o ~/docs/k8s/
# Verbose — see fetch decisions and token stats
rawdoc https://www.baeldung.com/spring-kafka -v
# MCP server mode (stdio JSON-RPC)
rawdoc --serve
Verbose Output
[tier1] https://pkg.go.dev/fmt → fetching
[stats] input: 139.2KB (35634 tokens) → output: 43.5KB (11135 tokens) | 69% saved
[output] wrote json to docs.json
All verbose output goes to stderr. stdout stays clean for piping.
Flags
Output
| Flag | Default | Description |
|---|
-o, --output | stdout | File or directory |
-f, --format | markdown | markdown text json yaml |
--code-only | — | Extract only code blocks |
--no-links | — | Strip link URLs, keep text only |
Crawling
| Flag | Default | Description |
|---|
-d, --depth | 0 | Crawl depth (0 = single page) |
-c, --concurrency | 5 | Parallel fetches |
--max-pages | 50 | Page limit |
--delay | 1s | Delay between requests |
--include | — | URL path glob to include |
--exclude | — | URL path glob to exclude |
--sitemap | — | Parse sitemap.xml for URL discovery |
HTTP
| Flag | Default | Description |
|---|
--timeout | 15s | Per-request timeout |
--max-time | 10m | Total runtime ceiling |
--max-retries | 3 | Per-URL retries with exponential backoff |
--header K=V | — | Extra header (repeatable) |
Info
| Flag | Default | Description |
|---|
-v, --verbose | — | Fetch log and token stats to stderr |
-q, --quiet | — | Suppress all stderr |
--serve | — | Run as MCP stdio server |
--version | — | Print version |
Crawl Mode
rawdoc https://kubernetes.io/docs/concepts/workloads/ -d 2 --max-pages 50 -o ~/docs/k8s/
Writes one .md file per page plus an index.md:
~/docs/k8s/
├── index.md
├── workloads.md
├── workloads-pods.md
├── workloads-controllers-deployment.md
└── ...
Stays on the same domain. Respects --include/--exclude globs and --max-pages limit.
Output Formats
| Format | Description |
|---|
markdown | Headings, code blocks, tables, lists (default) |
text | Plain text, no markup |
json | Structured: url, title, content, code_blocks, fetch_tier, token count |
yaml | Same fields as JSON |
--code-only | Only fenced code blocks from the page |
Site-Specific Selectors
Built-in content selectors for: Baeldung, Docusaurus, GitBook, ReadTheDocs, MkDocs, Spring.io, GitHub, MDN, Go pkg.dev, StackOverflow, Medium, Dev.to, Confluence, Notion.
Falls back to readability scoring when no selector matches.
Claude Code Plugin
What You Get