From web-crawler
High-performance Rust web crawler with stealth mode, LLM-ready Markdown export, multi-format output, sitemap discovery, and robots.txt support. Optimized for content extraction, site mapping, structure analysis, and LLM/RAG pipelines.
How this skill is triggered — by the user, by Claude, or both
Slash command
/web-crawler:website-crawlerThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
High-performance web crawler built in **pure Rust** with
Cargo.tomlREADME.mdconstitution.mdsrc/config/mod.rssrc/config/profiles.rssrc/crawler/checkpoint.rssrc/crawler/dedup.rssrc/crawler/engine.rssrc/crawler/mod.rssrc/crawler/rate_limiter.rssrc/crawler/robots.rssrc/crawler/worker.rssrc/integrations/mod.rssrc/integrations/raycast.rssrc/lib.rssrc/main.rssrc/output/html.rssrc/output/json.rssrc/output/mod.rssrc/output/streaming.rsHigh-performance web crawler built in pure Rust with production-grade features for fast, reliable site crawling.
Use this skill when the user requests:
[Progress] Pages: X/Y | Active jobs: Z | Errors: N~/.claude/skills/web-crawler/bin/rcrawler
# Clone the repository
git clone https://github.com/leobrival/rcrawler.git
cd rcrawler
# Build release binary
cargo build --release
# Copy to skill directory
cp target/release/rcrawler ~/.claude/skills/web-crawler/bin/
Build time: ~2 minutes Binary size: 5.4 MB
~/.claude/skills/web-crawler/bin/rcrawler <URL> [OPTIONS]
Core Options:
-w, --workers <N>: Number of concurrent workers (default: 20, range: 1-50)-d, --depth <N>: Maximum crawl depth (default: 2)-r, --rate <N>: Rate limit in requests/second (default: 2.0)Configuration:
-p, --profile <NAME>: Use predefined profile (fast/deep/gentle)--domain <DOMAIN>: Restrict to specific domain (auto-detected from URL)-o, --output <PATH>: Custom output directory (default: ./output)Features:
-s, --sitemap: Enable/disable sitemap discovery (default: true)--stealth: Enable stealth mode with user-agent rotation--markdown: Convert HTML to LLM-ready Markdown with frontmatter--filter-content: Enable content filtering (remove nav, ads, scripts)--debug: Enable debug logging with detailed trace information--resume: Resume from checkpoint if availableOutput:
-f, --formats <LIST>: Output formats (json,markdown,html,csv,links,text)~/.claude/skills/web-crawler/bin/rcrawler <URL> -p fast
~/.claude/skills/web-crawler/bin/rcrawler <URL> -p deep
~/.claude/skills/web-crawler/bin/rcrawler <URL> -p gentle
~/.claude/skills/web-crawler/bin/rcrawler https://example.com
Output:
[2026-01-10T01:17:27Z] INFO Starting crawl of: https://example.com
[2026-01-10T01:17:27Z] INFO Config: 20 workers, depth 2
Fetching sitemap URLs...
[Progress] Pages: 50/120 | Active jobs: 15 | Errors: 0
[Progress] Pages: 100/180 | Active jobs: 8 | Errors: 0
Crawl complete!
Pages crawled: 150
Duration: 8542ms
Results saved to: ./output/results.json
HTML report: ./output/index.html
~/.claude/skills/web-crawler/bin/rcrawler https://docs.example.com \
--stealth --markdown -f markdown -d 3
Use case: Content extraction for LLM/RAG pipelines Expected: Clean Markdown with frontmatter, anti-detection headers
~/.claude/skills/web-crawler/bin/rcrawler https://blog.example.com -p fast
Use case: Quick blog mapping Expected: 50 workers, depth 3, ~3-5 seconds for 100 pages
~/.claude/skills/web-crawler/bin/rcrawler https://example.com \
-f json,markdown,csv,links -o ./export
Use case: Export data in multiple formats simultaneously Expected: Generates results.json, results.md, results.csv, results.txt
~/.claude/skills/web-crawler/bin/rcrawler https://example.com --debug
Output: Detailed trace logs for troubleshooting
./output/
├── results.json # Structured crawl data
├── results.md # LLM-ready Markdown (with --markdown)
├── results.html # Interactive report
├── results.csv # Spreadsheet format (with -f csv)
├── results.txt # URL list (with -f links)
└── checkpoint.json # Auto-saved state (every 30s)
{
"stats": {
"pages_found": 450,
"pages_crawled": 450,
"external_links": 23,
"excluded_links": 89,
"errors": 0,
"start_time": "2026-01-10T01:00:00Z",
"end_time": "2026-01-10T01:00:07Z",
"duration": 7512
},
"results": [
{
"url": "https://example.com",
"title": "Example Domain",
"status_code": 200,
"depth": 0,
"links": ["https://example.com/page1", "..."],
"crawled_at": "2026-01-10T01:00:01Z",
"content_type": "text/html"
}
]
}
When a user requests a crawl, follow these steps:
Extract from user message:
~/.claude/skills/web-crawler/bin/rcrawler <URL> \
-w <workers> \
-d <depth> \
-r <rate> \
[--debug] \
[-o <output>]
Use Bash tool to run the command:
~/.claude/skills/web-crawler/bin/rcrawler https://example.com -w 20 -d 2
Watch for progress updates in output:
[Progress] Pages: X/Y | Active jobs: Z | Errors: NWhen crawl completes, inform user:
./output/results.json./output/index.htmlopen ./output/index.htmlRequest: "Crawl docs.example.com"
Parse: URL = https://docs.example.com, use defaults
Command: rcrawler https://docs.example.com
Request: "Quick scan of blog.example.com"
Parse: URL = blog.example.com, profile = fast
Command: rcrawler https://blog.example.com -p fast
Request: "Deep crawl of api-docs.example.com with 40 workers"
Parse: URL = api-docs.example.com, workers = 40, depth = deep
Command: rcrawler https://api-docs.example.com -w 40 -d 5
Request: "Crawl example.com carefully, don't overload their server"
Parse: URL = example.com, profile = gentle
Command: rcrawler https://example.com -p gentle
Request: "Map the structure of help.example.com"
Parse: URL = help.example.com, depth = moderate
Command: rcrawler https://help.example.com -d 3
# Check if binary exists
ls ~/.claude/skills/web-crawler/bin/rcrawler
# If missing, build it
cd ~/.claude/skills/web-crawler/scripts && cargo build --release
Network errors:
curl -I <URL>-r 1robots.txt blocking:
curl <URL>/robots.txtTimeout errors:
-w 10-r 1Too many errors:
--debugCrawlEngine (src/crawler/engine.rs)
RobotsChecker (src/crawler/robots.rs)
RateLimiter (src/crawler/rate_limiter.rs)
UrlFilter (src/utils/filters.rs)
HtmlParser (src/parser/html.rs)
SitemapParser (src/parser/sitemap.rs)
First crawl should use defaults to understand site structure.
Watch the [Progress] lines to ensure crawl is progressing.
Interactive visualization helps understand site structure better than JSON.
Default 2 req/s is safe for most sites. Increase cautiously.
--debug flag provides detailed logs for troubleshooting.
Check <URL>/robots.txt to understand crawling restrictions.
Avoid overwriting results with -o flag.
~/.claude/skills/web-crawler/bin/rcrawlerVersion: 1.0.0 Status: Production Ready
npx claudepluginhub leobrival/rcrawlerScrapes, maps, and crawls websites via Firecrawl MCP, returning raw HTML, metadata (og, twitter, JSON-LD), and screenshots. Invoked when WebFetch cannot access JS-rendered DOM or raw head data.
Crawls websites and extracts content from multiple pages via the Tavily CLI. Supports depth/breadth control, path filtering, semantic instructions, and saving pages as local markdown files.
Scrapes URLs to markdown/HTML/JSON, crawls websites for multi-page extraction, searches the web, maps sites, and extracts structured data using Firecrawl MCP tools.