By tpereyral
Automated web scraping pipeline — analyze sites, generate scripts, execute, validate, and self-correct.
An automated web scraping pipeline packaged as a Claude Code plugin. Provide a URL, describe what you want, and the system analyzes the site, generates a tailored Python script, executes it, validates results, and self-corrects through feedback loops.
git clone <repo-url> web-scraper-plugin
cd web-scraper-plugin
claude plugin marketplace add .
claude plugin install web-scraper
This installs the plugin at user scope. Claude Code caches all plugin files (agent prompts, templates, schemas), so the skill works from any project directory — you don't need to stay in this repo.
The /web-scraper skill will appear in your skills list after restarting the session.
If you just want to try it without installing:
claude --plugin-dir /path/to/web-scraper-plugin
Once installed, invoke the skill from any project:
/web-scraper https://example.com/products — extract all product images and descriptions
Or just describe what you want:
Scrape all article text and figures from https://example.com/blog
The pipeline runs automatically:
Results appear in ./scrape_output/{domain}/ in your current working directory.
scrape_output/
example.com/
images/ # Downloaded images at original quality
text/ # Extracted text as individual .txt files
results.json # Manifest with metadata, paths, and errors
| Type | Strategy | How it works |
|---|---|---|
| Static HTML | static_html | requests + BeautifulSoup |
| Paginated static | static_html_paginated | Follows next-page links |
| JS-rendered | js_rendered | Playwright headless browser |
| Infinite scroll | js_rendered_scroll | Playwright + scroll detection |
| API-backed | api_direct | Direct JSON API calls |
| Mixed | hybrid | Static first, Playwright fallback |
The system automatically detects which strategy to use.
The pipeline respects these settings (configurable per run):
| Setting | Default | Description |
|---|---|---|
| Rate limit | 1000ms | Delay between requests |
| Max pages | 10 | Maximum pages to scrape |
| Request timeout | 30s | Per-request timeout |
| Image quality | Original | Downloads full-resolution images |
| Script timeout | 5 min | Maximum script execution time |
.claude-plugin/
plugin.json # Plugin manifest
marketplace.json # Marketplace metadata
skills/
web-scraper/
SKILL.md # Orchestrator skill
references/ # Agent prompts + playbook
schemas/ # JSON Schema data contracts
assets/templates/ # Python script templates
docs/
prd-web-scraper-system.md
Web Scraping Pipeline.pdf
Playwright not installed — The system installs it automatically. If it fails:
pip install playwright && python -m playwright install chromium
Rate limiting / 429 errors — Increase the rate limit: "Scrape with a 2 second delay between requests"
Authentication required — Tell Claude: "The site requires login." It will ask for credentials and pass them via environment variables.
Partial results after 3 iterations — The site may have unusual structure. Check ./scrape_runs/ for run logs with details on what failed.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
npx claudepluginhub tpereyral/web-scraping-system --plugin web-scraperScrape, search, crawl, and map the web with a single command.
Claude Code skill pack for FireCrawl (30 skills)
Firecrawl v2.5 API for web scraping/crawling to LLM-ready markdown. Use for site extraction, dynamic content, or encountering JavaScript rendering, bot detection, content loading errors.
Self-learning intelligent web scraper agent - automatically analyzes page structure, handles pagination, anti-blocking, and discovers article series. No user configuration needed - AI decides everything.
Веб-скрапинг через Scrape.do с обходом защит и JavaScript рендерингом
The best web scraping tool for LLMs. USE --smart-extract to give your AI agent only the data it needs from any web page — extracts from JSON/HTML/XML/CSV/Markdown using path language with recursive search, filters, and regex. Handles JS, CAPTCHAs, anti-bot automatically. AI extraction in plain English. Google/Amazon/Walmart/YouTube/ChatGPT APIs. Batch, crawl, cron scheduling.