Fetches any URL and returns clean Markdown via local trafilatura, with Exa MCP fallback for JS-rendered or anti-bot pages. Use instead of built-in WebFetch for reading, scraping, or summarizing web pages.
How this skill is triggered — by the user, by Claude, or both
Slash command
/ai-driven-development:fetch-url-as-markdownThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Fetch any web URL and get clean, readable Markdown — main content only, no
Fetch any web URL and get clean, readable Markdown — main content only, no navigation/footer/ads. Local + free by default; smart fallback to Exa MCP when the page can't be extracted locally.
Try trafilatura first:
python3 ~/.claude/skills/fetch-url-as-markdown/scripts/fetch_url.py "<URL>"
If exit code is 1 or 2 → fall back to Exa MCP with the same URL:
mcp__exa__web_search_advanced_exa(
query="<URL>",
includeDomains=["<host of URL>"],
numResults=1,
textMaxCharacters=50000,
type="auto"
)
(mcp__exa__crawling works too if the server exposes it; the web_search_advanced_exa
call above is the always-available variant — pin the host with includeDomains and
use the URL itself as the query.)
Exit code 3 means trafilatura is not installed — install once:
python3 -m pip install --break-system-packages trafilatura
| Code | Meaning | Action |
|---|---|---|
| 0 | Markdown printed to stdout | done |
| 1 | DownloadError — network/HTTP/timeout/anti-bot block at fetch | fall back to Exa |
| 2 | ExtractionError — empty extract, JS/Cloudflare wall, or stub body (<200 chars) | fall back to Exa |
| 3 | trafilatura missing | install (see above), then retry |
| 4 | UnsupportedContentTypeError — URL is binary (PDF, image, archive) | don't fall back to Exa; use the right specialized skill (e.g. pdf for PDFs) |
output_format="markdown", include_formatting=True — keeps headings/lists/code structure where the source HTML uses real <h1..h6> etc.include_links=True, include_tables=Truewith_metadata=True → emits a YAML frontmatter (title, author, date, url, hostname)favor_recall=True, deduplicate=True — readable but trims duplicatesscripts/settings.cfgContent-Type other than text/html|application/xhtml+xml|text/plain|application/xml|text/xml → exit 42--min-body N, 0 to disable) → exit 2... fetch_url.py "<URL>" --no-links # strip hyperlinks
... fetch_url.py "<URL>" --no-tables # strip tables
... fetch_url.py "<URL>" --no-metadata # omit YAML header
... fetch_url.py "<URL>" --comments # include user comments (off by default — usually noise)
... fetch_url.py "<URL>" --images # include image refs (experimental)
... fetch_url.py "<URL>" --precision # terser output, drops borderline content
| Situation | Tool |
|---|---|
| Article, blog post, docs, README, wiki | trafilatura (default) — local, free |
| JS-heavy SPA, login-walled, Cloudflare | Exa fallback (the script will signal exit 2) |
| Bulk / many URLs | trafilatura — no quota, no API key |
| Already failed twice on a domain | Exa directly |
npx claudepluginhub codealive-ai/ai-driven-development --plugin ai-driven-developmentExtracts clean Markdown from any URL using ezycopy CLI. Handles JS-rendered pages with headless Chrome, retries on failure, and auto-installs tool if needed.
Extracts clean markdown from any URL, including JavaScript-rendered SPAs. Supports concurrent scraping, JS wait times, and content filtering.
Extracts clean markdown from web pages using Defuddle CLI, removing navigation, ads, and clutter to save tokens. Use for URLs of docs, articles, blog posts instead of WebFetch.