From newsroom
Web scraping guide for sub-agents. Covers Firecrawl CLI fallback scraping when WebFetch fails (JS-heavy sites, anti-bot walls, 403 errors, empty content) and advanced capabilities like structured data extraction with Zod schemas, multi-page crawls, and search-plus-scrape. Use when WebFetch returns garbage or empty pages, when you need typed data from a page (prices, features, specs), or when you need to ingest multiple pages from a site.
How this skill is triggered — by the user, by Claude, or both
Slash command
/newsroom:web-scrapingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Required tools for consuming agents**: WebFetch, Bash(bunx firecrawl-cli *), Read
Required tools for consuming agents: WebFetch, Bash(bunx firecrawl-cli *), Read
Integration: Any newsroom sub-agent should consult this skill when WebFetch fails or when structured/multi-page scraping is needed.
| Need | Tool | Details |
|---|---|---|
| Page content as markdown | WebFetch first, then Firecrawl CLI | See below |
| Structured data from a page (prices, features, specs) | Firecrawl extract | Read references/structured-extraction.md |
| Multiple pages from one site | Firecrawl crawl | Read references/crawling.md |
| Search the web + scrape results | Firecrawl search | Read references/crawling.md |
WebFetch is free, fast, and already available. Use it by default.
Works for: blogs, news articles, documentation, static pages, most forum threads.
Switch to Firecrawl CLI when WebFetch returns:
Do NOT retry WebFetch on the same URL -- it will fail again.
Requires: firecrawl-cli (install: npm install -g firecrawl-cli or use via bunx firecrawl-cli). Authenticates via FIRECRAWL_API_KEY env var or firecrawl auth --api-key <key>.
If firecrawl-cli is not installed or FIRECRAWL_API_KEY is unset, skip to Step 4 (Report Gaps). Do not retry or attempt workarounds.
Output to stdout (default -- pipe or capture as needed):
bunx firecrawl-cli scrape "<url>"
Output to file (more token-efficient -- read from disk instead of context):
bunx firecrawl-cli scrape "<url>" -o /tmp/scrape-output.md
Then use the Read tool on /tmp/scrape-output.md to pull only what you need into context.
Handles: JS rendering, dynamic content, basic anti-bot bypass, clean Markdown output (strips nav, headers, footers with --only-main-content).
Does NOT handle: login-gated content, CAPTCHAs, form filling, aggressive Cloudflare Turnstile.
For multiple URLs, scrape each separately to different files:
bunx firecrawl-cli scrape "<url1>" -o /tmp/scrape-1.md
bunx firecrawl-cli scrape "<url2>" -o /tmp/scrape-2.md
The CLI is beta (released Jan 2026) -- expect quirks and flag changes. Run bunx firecrawl-cli scrape --help for current options.
If both WebFetch and Firecrawl fail:
npx claudepluginhub nathanvale/side-quest-plugins --plugin newsroomScrapes web pages and websites using Firecrawl API, converting to clean markdown. Handles JavaScript rendering, anti-bot protection, paywalled content, and dynamic sites for articles, blogs, docs.
Automates web crawling and data extraction using Firecrawl: scrape pages, crawl sites, extract structured data with AI, batch URLs, and map site structures.
Extracts clean markdown from any URL, including JavaScript-rendered SPAs. Supports concurrent scraping, JS wait times, and content filtering.