Web Scraper Plugin for Claude Code

An automated web scraping pipeline packaged as a Claude Code plugin. Provide a URL, describe what you want, and the system analyzes the site, generates a tailored Python script, executes it, validates results, and self-corrects through feedback loops.

Installation

1. Clone the repo

git clone <repo-url> web-scraper-plugin
cd web-scraper-plugin

2. Register the marketplace and install the plugin

claude plugin marketplace add .
claude plugin install web-scraper

This installs the plugin at user scope. Claude Code caches all plugin files (agent prompts, templates, schemas), so the skill works from any project directory — you don't need to stay in this repo.

3. Restart Claude Code

The /web-scraper skill will appear in your skills list after restarting the session.

Alternative: load for a single session

If you just want to try it without installing:

claude --plugin-dir /path/to/web-scraper-plugin

Prerequisites

Claude Code (CLI, desktop app, or IDE extension)
Python 3.10+
pip (scraping dependencies are installed on the fly)

Usage

Once installed, invoke the skill from any project:

/web-scraper https://example.com/products — extract all product images and descriptions

Or just describe what you want:

Scrape all article text and figures from https://example.com/blog

The pipeline runs automatically:

Intent Capture — confirms what you want (IntentBrief)
Site Analysis — analyzes HTML structure, detects pagination, JS rendering
Execution — generates and runs a Python script
Validation — checks results, iterates up to 3 times if needed

Results appear in ./scrape_output/{domain}/ in your current working directory.

Output Structure

scrape_output/
  example.com/
    images/           # Downloaded images at original quality
    text/             # Extracted text as individual .txt files
    results.json      # Manifest with metadata, paths, and errors

Supported Site Types

Type	Strategy	How it works
Static HTML	`static_html`	requests + BeautifulSoup
Paginated static	`static_html_paginated`	Follows next-page links
JS-rendered	`js_rendered`	Playwright headless browser
Infinite scroll	`js_rendered_scroll`	Playwright + scroll detection
API-backed	`api_direct`	Direct JSON API calls
Mixed	`hybrid`	Static first, Playwright fallback

The system automatically detects which strategy to use.

Configuration

The pipeline respects these settings (configurable per run):

Setting	Default	Description
Rate limit	1000ms	Delay between requests
Max pages	10	Maximum pages to scrape
Request timeout	30s	Per-request timeout
Image quality	Original	Downloads full-resolution images
Script timeout	5 min	Maximum script execution time

Plugin Structure

.claude-plugin/
  plugin.json              # Plugin manifest
  marketplace.json         # Marketplace metadata
skills/
  web-scraper/
    SKILL.md               # Orchestrator skill
    references/            # Agent prompts + playbook
    schemas/               # JSON Schema data contracts
    assets/templates/      # Python script templates
docs/
  prd-web-scraper-system.md
  Web Scraping Pipeline.pdf

Troubleshooting

Playwright not installed — The system installs it automatically. If it fails:

pip install playwright && python -m playwright install chromium

Rate limiting / 429 errors — Increase the rate limit: "Scrape with a 2 second delay between requests"

Authentication required — Tell Claude: "The site requires login." It will ask for credentials and pass them via environment variables.

Partial results after 3 iterations — The site may have unusual structure. Check ./scrape_runs/ for run logs with details on what failed.

web-scraper

Popularity

What's Inside

README

Web Scraper Plugin for Claude Code

Installation

1. Clone the repo

2. Register the marketplace and install the plugin

3. Restart Claude Code

Alternative: load for a single session

Prerequisites

Usage

Output Structure

Supported Site Types

Configuration

Plugin Structure

Troubleshooting

Confidence

Similar Plugins

firecrawl

firecrawl-pack

firecrawl-scraper

intelligent-web-scraper

scrapedo-web-scraper

scrapingbee