From agent-almanac
Extracts data from JS-rendered, Cloudflare-protected, or dynamic SPA pages using the scrapling Python library with three-tier fetcher selection (HTTP, stealth Chromium, full browser automation) and CSS selectors. Use when WebFetch or simple HTTP requests fail due to anti-bot defenses or DOM-traversal needs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-almanac:headless-web-scrapingThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Extract data from web pages that resist simple HTTP requests — JS-rendered content,
Extract data from web pages that resist simple HTTP requests — JS-rendered content, Cloudflare-protected sites, and dynamic SPAs — using scrapling's three-tier fetcher architecture and CSS-based data extraction.
WebFetch or requests.get() returns empty or blocked responsesDetermine which scrapling fetcher matches the target site's defenses.
# Decision matrix:
# 1. Fetcher — static HTML, no JS, no anti-bot (fastest)
# 2. StealthyFetcher — Cloudflare/Turnstile, TLS fingerprint checks
# 3. DynamicFetcher — JS-rendered SPAs, click/scroll interactions
# Quick probe: try Fetcher first, escalate on failure
from scrapling import Fetcher
fetcher = Fetcher()
response = fetcher.get("https://example.com/target-page")
if response.status == 200 and response.get_all_text():
print("Fetcher tier sufficient")
else:
print("Escalate to StealthyFetcher or DynamicFetcher")
| Signal | Recommended Tier |
|---|---|
| Static HTML, no protection | Fetcher |
| 403/503, Cloudflare challenge page | StealthyFetcher |
| Page loads but content area is empty | DynamicFetcher |
| Need to click buttons or scroll | DynamicFetcher |
| altcha CAPTCHA present | None (cannot be automated) |
Expected: One of the three tiers is identified. For most modern sites, StealthyFetcher is the correct starting point.
On failure: If all three tiers return blocked responses, check whether the site uses altcha CAPTCHA (proof-of-work challenge that cannot be bypassed). If so, document the limitation and provide manual extraction instructions instead.
Set up the selected fetcher with appropriate options.
from scrapling import Fetcher, StealthyFetcher, DynamicFetcher
# Tier 1: Fast HTTP with TLS fingerprint impersonation
fetcher = Fetcher()
fetcher.configure(
timeout=30,
retries=3,
follow_redirects=True
)
# Tier 2: Headless Chromium with anti-detection
fetcher = StealthyFetcher()
fetcher.configure(
headless=True,
timeout=60,
network_idle=True # wait for all network requests to settle
)
# Tier 3: Full browser automation
fetcher = DynamicFetcher()
fetcher.configure(
headless=True,
timeout=90,
network_idle=True,
wait_selector="div.results" # wait for specific element before extracting
)
Expected: Fetcher instance is configured and ready. No errors on instantiation. For StealthyFetcher and DynamicFetcher, a Chromium binary is available (scrapling manages this automatically on first run).
On failure:
playwright or browser binary not found -- run python -m playwright install chromiumconfigure() -- increase timeout value or check network connectivitypip install scraplingNavigate to the target URL and extract structured data using CSS selectors.
# Fetch the page
response = fetcher.get("https://example.com/target-page")
# Single element extraction
title = response.find("h1.page-title")
if title:
print(title.get_all_text())
# Multiple elements
items = response.find_all("div.result-item")
for item in items:
name = item.find("span.name")
price = item.find("span.price")
print(f"{name.get_all_text()}: {price.get_all_text()}")
# Get attribute values
links = response.find_all("a.product-link")
urls = [link.get("href") for link in links]
# Get raw HTML content of an element
detail_html = response.find("div.description").html_content
Key API reference:
| Method | Purpose |
|---|---|
response.find("selector") | First matching element |
response.find_all("selector") | All matching elements |
element.get("attr") | Attribute value (href, src, data-*) |
element.get_all_text() | All text content, recursively |
element.html_content | Raw inner HTML |
Expected: Extracted data matches the visible page content. Elements are non-None and text content is non-empty for populated pages.
On failure:
find() returns None -- inspect the actual HTML (response.html_content) to verify the selector; the page may use different class names than expectedget_all_text() -- content may be inside shadow DOM or an iframe; try DynamicFetcher with a wait_selector.css_first() -- this is not part of the scrapling API (common confusion with other libraries)Implement fallback logic for CAPTCHA detection, empty responses, and session requirements.
import time
def scrape_with_fallback(url, selector):
"""Try each fetcher tier in order, with CAPTCHA detection."""
tiers = [
("Fetcher", Fetcher),
("StealthyFetcher", StealthyFetcher),
("DynamicFetcher", DynamicFetcher),
]
for tier_name, tier_class in tiers:
fetcher = tier_class()
fetcher.configure(headless=True, timeout=60)
try:
response = fetcher.get(url)
except Exception as error:
print(f"{tier_name} failed: {error}")
continue
# Detect CAPTCHA / challenge pages
page_text = response.get_all_text().lower()
if "altcha" in page_text or "proof of work" in page_text:
print(f"altcha CAPTCHA detected -- cannot automate")
return None
if response.status == 403 or response.status == 503:
print(f"{tier_name} blocked (HTTP {response.status}), escalating")
continue
result = response.find(selector)
if result and result.get_all_text().strip():
return result.get_all_text()
print(f"{tier_name} returned empty content, escalating")
print("All tiers exhausted. Manual extraction required.")
return None
Expected: Function returns extracted text on success, or None with a diagnostic message when all tiers fail. CAPTCHA pages are detected and reported rather than retried indefinitely.
On failure:
Implement delays and respect site policies before running at scale.
import time
import urllib.robotparser
def check_robots_txt(base_url, target_path):
"""Check if scraping is allowed by robots.txt."""
rp = urllib.robotparser.RobotFileParser()
rp.set_url(f"{base_url}/robots.txt")
rp.read()
return rp.can_fetch("*", f"{base_url}{target_path}")
def scrape_urls(urls, selector, delay=1.0):
"""Scrape multiple URLs with rate limiting."""
results = []
fetcher = StealthyFetcher()
fetcher.configure(headless=True, timeout=60)
for url in urls:
response = fetcher.get(url)
data = response.find(selector)
if data:
results.append(data.get_all_text())
time.sleep(delay) # respect the server
return results
Ethical scraping checklist:
robots.txt before scraping -- respect Disallow directivesExpected: Scraping runs at a controlled rate. robots.txt is checked before bulk operations. No 429 responses are triggered.
On failure:
robots.txt disallows the path -- respect the directive; do not override itconfigure() method is used (not deprecated constructor kwargs).find() / .find_all() API is used (not .css_first() or other library methods)robots.txt is checked before bulk operations.css_first() instead of .find(): scrapling uses .find() and .find_all() for element selection -- .css_first() belongs to a different library and will raise AttributeErrorFetcher first, then escalate -- DynamicFetcher is 10-50x slower due to full browser startupconfigure(): scrapling v0.4.x deprecated passing options to the constructor; always use the configure() methodnpx claudepluginhub pjt222/agent-almanacScrapes web pages via Scrape.do API to bypass blocks, CAPTCHA, and WebFetch errors like 403, 401, 429, timeouts, access denied, Cloudflare. Auto-activates on failures.
Builds production-ready web scrapers for any site using Bright Data infrastructure. Guides site analysis, API selection, selector extraction, pagination, and implementation.
Unblocks 4xx/WAF/captcha/JS-SPA web fetches via escalating free chain: public APIs, Jina Reader, curl/TLS impersonation, Playwright headless, archives until valid body. Zero keys.