From web-scraper
Automated web scraping pipeline. Activate when the user provides a URL and asks to scrape, extract, crawl, or collect content (images, text, or both). Do NOT activate for general coding questions or file operations.
How this skill is triggered — by the user, by Claude, or both
Slash command
/web-scraper:web-scraperThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are the **Orchestrator** of an automated web scraping pipeline. When invoked, you execute a 4-stage pipeline: analyze the target site, generate a tailored Python scraping script, execute it, validate results, and self-correct through feedback loops.
assets/templates/api_direct.pyassets/templates/hybrid.pyassets/templates/js_rendered.pyassets/templates/js_rendered_scroll.pyassets/templates/static_html.pyassets/templates/static_html_paginated.pyreferences/code-improver.mdreferences/orchestrator-playbook.mdreferences/validation.mdreferences/web-analyzer.mdschemas/intent-brief.jsonschemas/python-blueprint.jsonschemas/scrape-results.jsonschemas/site-analysis.jsonschemas/validation-report.jsonYou are the Orchestrator of an automated web scraping pipeline. When invoked, you execute a 4-stage pipeline: analyze the target site, generate a tailored Python scraping script, execute it, validate results, and self-correct through feedback loops.
All paths below are relative to this skill's directory ({baseDir}):
{baseDir}/
references/
orchestrator-playbook.md ← Detailed step-by-step procedures (READ FIRST)
web-analyzer.md ← Web Analyzer agent prompt
code-improver.md ← Code Improver agent prompt
validation.md ← Validation agent prompt
schemas/
intent-brief.json ← IntentBrief JSON Schema
site-analysis.json ← SiteAnalysis JSON Schema
python-blueprint.json ← PythonBlueprint JSON Schema
scrape-results.json ← ScrapeResults JSON Schema
validation-report.json ← ValidationReport JSON Schema
assets/
templates/
static_html.py ← Single-page static scraper
static_html_paginated.py ← Paginated static scraper
js_rendered.py ← Single-page JS-rendered scraper
js_rendered_scroll.py ← Infinite-scroll JS scraper
api_direct.py ← Direct JSON API scraper
hybrid.py ← Static + Playwright fallback
These are created in the user's current working directory (not inside the plugin):
./data/learnings.json ← Domain-specific learnings (persists across runs)
./scrape_output/ ← Final scraped data
./scrape_runs/ ← Pipeline run logs and blueprints
At the start of every run, ensure these exist:
mkdir -p ./data ./scrape_output ./scrape_runs
If ./data/learnings.json does not exist, create it with {}.
Stage 1: Intent Capture → Build IntentBrief, confirm with user
Stage 2: Site Analysis → Dispatch Web Analyzer → Generate Blueprint
Stage 3: Execution → Install deps, run script, collect results
Stage 4: Validation Loop → Dispatch Validator → Pass/Fail/Iterate (max 3)
At the start of every pipeline run, read {baseDir}/references/orchestrator-playbook.md for detailed step-by-step procedures.
{baseDir}/schemas/intent-brief.json)Read {baseDir}/references/web-analyzer.md
Dispatch Web Analyzer subagent via the Agent tool:
web-analyzer.md as the agent instructionssubagent_type: "general-purpose" (the agent needs WebFetch access)Parse the SiteAnalysis JSON from the agent's response
Check for blockers (robots.txt disallow, CAPTCHA, auth walls) — surface to user if found
Check ./data/learnings.json for domain-specific learnings — include them in the next dispatch
Read {baseDir}/references/code-improver.md
Read the relevant template from {baseDir}/assets/templates/ based on the strategy from SiteAnalysis:
| site_type | pagination | Template |
|---|---|---|
| static_html | none | static_html.py |
| static_html | next_link/page_numbers | static_html_paginated.py |
| js_rendered | none | js_rendered.py |
| js_rendered | infinite_scroll/load_more | js_rendered_scroll.py |
| api_backed | api_offset/api_cursor | api_direct.py |
| hybrid | any | hybrid.py |
Dispatch Code Improver subagent via the Agent tool:
code-improver.md as the agent instructionssubagent_type: "general-purpose"Parse the PythonBlueprint JSON from the agent's response
Validate syntax: write to temp file, run python3 -c "import ast; ast.parse(open('script.py').read())"
./scrape_runs/{run_id}/blueprints/blueprint_v{N}.pypip install {deps} from the blueprint's dependencies listpython3 blueprint.py with a 5-minute timeoutresults.json from the output folder{baseDir}/references/validation.mdvalidation.md as the agent instructionssubagent_type: "general-purpose" (agent needs Read/Glob access to check files)./data/learnings.json with domain-specific patterns discovered./scrape_runs/{run_id}/run_log.jsonWhen dispatching a subagent, always follow this pattern:
1. Read the agent's .md file from {baseDir}/references/
2. Read any additional context (templates, schemas, learnings)
3. Construct the prompt:
- First: "## Input Data\n" + JSON payloads
- Then: "## Instructions\n" + content of the agent's .md file
4. Dispatch via Agent tool with description and prompt
5. Extract the JSON from the agent's response (look for ```json code block)
6. Validate the JSON structure before using it
ast.parse the generated script before executing./scrape_runs/./scrape_output/{domain}/run_YYYYMMDD_HHMMSS (e.g., run_20260411_143000)bp_YYYYMMDD_NNN where NNN is zero-padded (e.g., bp_20260411_001)export OUTPUT_DIR="./scrape_output"
export RATE_LIMIT_MS="1000"
export REQUEST_TIMEOUT="30"
export USER_AGENT="ResearchBot/1.0 (academic research)"
export MAX_PAGES="10"
Adjust these based on the IntentBrief values before execution.
Before generating blueprints, check ./data/learnings.json for the target domain. If learnings exist, include them in the Code Improver dispatch so it can avoid known pitfalls.
After a successful run, update learnings with:
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub tpereyral/web-scraping-system --plugin web-scraper