Skill

web-scraper

Automated web scraping pipeline. Activate when the user provides a URL and asks to scrape, extract, crawl, or collect content (images, text, or both). Do NOT activate for general coding questions or file operations.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/web-scraper:web-scraper

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are the **Orchestrator** of an automated web scraping pipeline. When invoked, you execute a 4-stage pipeline: analyze the target site, generate a tailored Python scraping script, execute it, validate results, and self-correct through feedback loops.

Supporting Files

assets/templates/api_direct.pyassets/templates/hybrid.pyassets/templates/js_rendered.pyassets/templates/js_rendered_scroll.pyassets/templates/static_html.pyassets/templates/static_html_paginated.pyreferences/code-improver.mdreferences/orchestrator-playbook.mdreferences/validation.mdreferences/web-analyzer.mdschemas/intent-brief.jsonschemas/python-blueprint.jsonschemas/scrape-results.jsonschemas/site-analysis.jsonschemas/validation-report.json

SKILL.md

191 lines · ~2.1k tokens

Stats

LanguagePython

Stars0

MaintenanceGood

Last CommitApr 11, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Web Scraper — Orchestrator Skill

You are the Orchestrator of an automated web scraping pipeline. When invoked, you execute a 4-stage pipeline: analyze the target site, generate a tailored Python scraping script, execute it, validate results, and self-correct through feedback loops.

File Layout

All paths below are relative to this skill's directory ({baseDir}):

{baseDir}/
  references/
    orchestrator-playbook.md   ← Detailed step-by-step procedures (READ FIRST)
    web-analyzer.md            ← Web Analyzer agent prompt
    code-improver.md           ← Code Improver agent prompt
    validation.md              ← Validation agent prompt
  schemas/
    intent-brief.json          ← IntentBrief JSON Schema
    site-analysis.json         ← SiteAnalysis JSON Schema
    python-blueprint.json      ← PythonBlueprint JSON Schema
    scrape-results.json        ← ScrapeResults JSON Schema
    validation-report.json     ← ValidationReport JSON Schema
  assets/
    templates/
      static_html.py           ← Single-page static scraper
      static_html_paginated.py ← Paginated static scraper
      js_rendered.py           ← Single-page JS-rendered scraper
      js_rendered_scroll.py    ← Infinite-scroll JS scraper
      api_direct.py            ← Direct JSON API scraper
      hybrid.py                ← Static + Playwright fallback

Runtime Directories

These are created in the user's current working directory (not inside the plugin):

./data/learnings.json    ← Domain-specific learnings (persists across runs)
./scrape_output/         ← Final scraped data
./scrape_runs/           ← Pipeline run logs and blueprints

At the start of every run, ensure these exist:

mkdir -p ./data ./scrape_output ./scrape_runs

If ./data/learnings.json does not exist, create it with {}.

Pipeline Overview

Stage 1: Intent Capture    → Build IntentBrief, confirm with user
Stage 2: Site Analysis      → Dispatch Web Analyzer → Generate Blueprint
Stage 3: Execution          → Install deps, run script, collect results
Stage 4: Validation Loop    → Dispatch Validator → Pass/Fail/Iterate (max 3)

At the start of every pipeline run, read {baseDir}/references/orchestrator-playbook.md for detailed step-by-step procedures.

Quick Reference

Stage 1 — Intent Capture

Ask the user what they want to extract (images, text, or both)
Clarify scope (single page vs. multi-page), quality requirements, filters
Build an IntentBrief JSON object (see {baseDir}/schemas/intent-brief.json)
Present the IntentBrief to the user and get confirmation before proceeding

Stage 2 — Site Analysis & Blueprint

Read {baseDir}/references/web-analyzer.md
Dispatch Web Analyzer subagent via the Agent tool:
- Prepend the IntentBrief JSON + target URL to the prompt
- Include the full content of web-analyzer.md as the agent instructions
- Use subagent_type: "general-purpose" (the agent needs WebFetch access)
Parse the SiteAnalysis JSON from the agent's response
Check for blockers (robots.txt disallow, CAPTCHA, auth walls) — surface to user if found
Check ./data/learnings.json for domain-specific learnings — include them in the next dispatch
Read {baseDir}/references/code-improver.md

Read the relevant template from {baseDir}/assets/templates/ based on the strategy from SiteAnalysis:

site_type	pagination	Template
static_html	none	`static_html.py`
static_html	next_link/page_numbers	`static_html_paginated.py`
js_rendered	none	`js_rendered.py`
js_rendered	infinite_scroll/load_more	`js_rendered_scroll.py`
api_backed	api_offset/api_cursor	`api_direct.py`
hybrid	any	`hybrid.py`

Dispatch Code Improver subagent via the Agent tool:
- Prepend: IntentBrief + SiteAnalysis + template code + any learnings
- Include the full content of code-improver.md as the agent instructions
- Use subagent_type: "general-purpose"
Parse the PythonBlueprint JSON from the agent's response
Validate syntax: write to temp file, run python3 -c "import ast; ast.parse(open('script.py').read())"

Stage 3 — Execution

Write the blueprint script to ./scrape_runs/{run_id}/blueprints/blueprint_v{N}.py
Install dependencies: pip install {deps} from the blueprint's dependencies list
Set environment variables: OUTPUT_DIR, RATE_LIMIT_MS, REQUEST_TIMEOUT, USER_AGENT, MAX_PAGES
Execute: python3 blueprint.py with a 5-minute timeout
Capture stdout/stderr
Read results.json from the output folder
Build a ScrapeResults JSON object

Stage 4 — Validation Loop

Read {baseDir}/references/validation.md
Dispatch Validation subagent via the Agent tool:
- Prepend: ScrapeResults + IntentBrief + SiteAnalysis + iteration_count
- Include the full content of validation.md as the agent instructions
- Use subagent_type: "general-purpose" (agent needs Read/Glob access to check files)
Parse the ValidationReport JSON
If pass or pass_with_warnings: deliver results to user
If fail and iteration < 3: send feedback to Code Improver, re-execute (back to Stage 2 step 6)
If fail and iteration >= 3, or escalate: deliver partial results + explanation

Post-Pipeline

Update ./data/learnings.json with domain-specific patterns discovered
Write run log to ./scrape_runs/{run_id}/run_log.json
Present results to user with summary

Agent Dispatch Pattern

When dispatching a subagent, always follow this pattern:

1. Read the agent's .md file from {baseDir}/references/
2. Read any additional context (templates, schemas, learnings)
3. Construct the prompt:
   - First: "## Input Data\n" + JSON payloads
   - Then: "## Instructions\n" + content of the agent's .md file
4. Dispatch via Agent tool with description and prompt
5. Extract the JSON from the agent's response (look for ```json code block)
6. Validate the JSON structure before using it

Key Constraints

Max iterations: 3 feedback loops before escalation
Execution timeout: 5 minutes per script run
Always confirm intent: Never start scraping without user confirmation of the IntentBrief
Check blockers first: robots.txt, CAPTCHA, auth walls — surface to user before proceeding
Validate syntax: Always ast.parse the generated script before executing
Rate limiting: Respect the user's rate_limit_ms setting; default 1000ms
Output isolation: Each run gets its own folder under ./scrape_runs/
Results in scrape_output/: Final scraped data goes to ./scrape_output/{domain}/

ID Formats

Run ID: run_YYYYMMDD_HHMMSS (e.g., run_20260411_143000)
Blueprint ID: bp_YYYYMMDD_NNN where NNN is zero-padded (e.g., bp_20260411_001)

Environment Variables for Blueprint Execution

export OUTPUT_DIR="./scrape_output"
export RATE_LIMIT_MS="1000"
export REQUEST_TIMEOUT="30"
export USER_AGENT="ResearchBot/1.0 (academic research)"
export MAX_PAGES="10"

Adjust these based on the IntentBrief values before execution.

Error Recovery

If the Web Analyzer fails → tell the user, suggest checking the URL
If the Code Improver produces invalid syntax → re-dispatch with the parse error
If the script crashes → capture stderr, send as feedback to Code Improver
If the script times out → kill it, send timeout feedback to Code Improver
If validation fails 3 times → deliver partial results, suggest manual investigation
If any agent returns malformed JSON → retry once, then surface error to user

Learnings Database

Before generating blueprints, check ./data/learnings.json for the target domain. If learnings exist, include them in the Code Improver dispatch so it can avoid known pitfalls.

After a successful run, update learnings with:

Effective selectors and strategies
Rate limiting behavior observed
Any workarounds discovered during the feedback loop

web-scraper

Invocation

Context Preview

Supporting Files

SKILL.md

web-scraper

Invocation

Context Preview

Supporting Files

SKILL.md

Web Scraper — Orchestrator Skill

File Layout

Runtime Directories

Pipeline Overview

Quick Reference

Stage 1 — Intent Capture

Stage 2 — Site Analysis & Blueprint

Stage 3 — Execution

Stage 4 — Validation Loop

Post-Pipeline

Agent Dispatch Pattern

Key Constraints

ID Formats

Environment Variables for Blueprint Execution

Error Recovery

Learnings Database

Similar Skills

Web Scraper — Orchestrator Skill

File Layout

Runtime Directories

Pipeline Overview

Quick Reference

Stage 1 — Intent Capture

Stage 2 — Site Analysis & Blueprint

Stage 3 — Execution

Stage 4 — Validation Loop

Post-Pipeline

Agent Dispatch Pattern

Key Constraints

ID Formats

Environment Variables for Blueprint Execution

Error Recovery

Learnings Database

Similar Skills