From agent-almanac
Escalates blocked scraping campaigns via provider-neutral proxy rotation across datacenter, residential, and mobile pools. Integrates with scrapling, handles session stickiness, cost monitoring, and legal boundaries. Use after client-side stealth is exhausted.
How this skill is triggered — by the user, by Claude, or both
Slash command
/agent-almanac:rotate-scraping-proxiesThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Network-layer escalation for scraping campaigns where client-side stealth has
Network-layer escalation for scraping campaigns where client-side stealth has already been exhausted. Proxy rotation is a last resort, not a default — it is expensive, ethically charged, and easily misused. This skill teaches when not to use it as much as how to use it well.
headless-web-scraping (Fetcher → StealthyFetcher → DynamicFetcher) has
been tried and the target still returns 403/429/geo-blocksrobots.txt permits
the pathpython-requests)Do not use when: a public API exists (use it), the site's ToS forbids automated access, you would be circumventing geo-licensing, or the goal is fraud / credential stuffing / sneaker bots / content piracy.
Gate the entire workflow on a documented legal and ethical review. Skipping this step is the single biggest source of harm.
# Inputs to confirm before writing any code:
# 1. Is the data public (no login required)?
# 2. Does robots.txt permit the path?
# 3. Does the site's ToS prohibit automated access? (read it)
# 4. Would the scraping process personal data? If yes, what is the legal basis?
# 5. Could this access circumvent geo-licensing, paywalls, or auth?
# 6. Is there a public API or data dump that would make scraping unnecessary?
# 7. Have you contacted the site owner if scope is large?
Expected: Every question has a defensible written answer. The first "no" or "unknown" stops the procedure until resolved.
On failure:
Different pool types have different cost, detectability, and ethical profiles. Pick the cheapest tier that actually solves your block.
| Pool type | Detectability | Cost | Best for |
|---|---|---|---|
| Datacenter | High (easily blocked by Cloudflare/Akamai) | $ | Sites with no real anti-bot, geo-shifting only |
| Residential | Low (real ISP IPs) | $$$ | Sites that block datacenter ASNs |
| Mobile | Very low (carrier-grade NAT, shared with thousands) | $$$$ | Sites that even block residential (rare) |
Ethical caveat for residential and mobile: these pools route your traffic through real consumer connections. The pool operator's consent model varies — some pay users, some bundle exit-node consent into "free VPN" EULAs that users do not read. Prefer providers with audited, opt-in consent. If you would not be comfortable with a stranger sending your scraping traffic through your home router, do not send yours through theirs.
Expected: A documented choice with the cheapest viable tier and a brief note on why higher tiers were rejected (or why a higher tier is needed).
On failure:
Wire the proxy into scrapling fetchers. Read credentials from environment
variables — never hard-code, never commit a .env to git.
import os
import random
from scrapling import Fetcher, StealthyFetcher
# Pattern A: provider-managed rotating endpoint (one URL, provider rotates per request)
PROXY_URL = os.environ["SCRAPING_PROXY_URL"] # http://user:[email protected]:7777
fetcher = StealthyFetcher()
fetcher.configure(
headless=True,
timeout=60,
network_idle=True,
proxy=PROXY_URL,
)
# Pattern B: explicit pool, rotate yourself
POOL = os.environ["SCRAPING_PROXY_POOL"].split(",") # comma-separated URLs
def fetch_with_rotation(url):
proxy = random.choice(POOL)
fetcher = StealthyFetcher()
fetcher.configure(headless=True, timeout=60, proxy=proxy)
return fetcher.get(url)
Expected: Requests succeed and the egress IP varies between calls.
Confirm by hitting an IP-echo endpoint (e.g. https://api.ipify.org) before
running the real scrape.
On failure:
-rotating or per-request flagDecide rotation granularity per workload, then keep the pool healthy.
# Sticky session for stateful flows (login, multi-page checkout-like crawls)
# Most providers expose a session ID via the username:
# user-session-abc123:[email protected]:7777
# All requests with the same session ID exit through the same IP for ~10 min.
# Per-request rotation for anonymous bulk scraping (default)
# Pool health check — call before bulk run
def check_pool(pool, sample_size=5):
sample = random.sample(pool, min(sample_size, len(pool)))
alive = []
for proxy in sample:
try:
r = StealthyFetcher().configure(proxy=proxy, timeout=10).get(
"https://api.ipify.org"
)
if r.status == 200:
alive.append(proxy)
except Exception:
pass
return alive
# Backoff on transient proxy failures
def fetch_with_backoff(url, max_attempts=3):
for attempt in range(max_attempts):
try:
r = fetch_with_rotation(url)
if r.status not in (407, 502, 503):
return r
except Exception:
pass
time.sleep(2 ** attempt)
return None
Expected: Stateful flows preserve cookies across requests; bulk anonymous scraping shows IP variance across requests; dead proxies are skipped instead of looping.
On failure:
Proxy traffic has a per-GB cost and a per-request cost. Runaway scrapers generate runaway invoices. Always include limits and an abort.
import time
class ScrapeBudget:
def __init__(self, max_requests, max_duration_seconds, max_failures):
self.max_requests = max_requests
self.max_duration = max_duration_seconds
self.max_failures = max_failures
self.requests = 0
self.failures = 0
self.start = time.monotonic()
def allow(self):
if self.requests >= self.max_requests:
return False, "request cap reached"
if time.monotonic() - self.start >= self.max_duration:
return False, "time cap reached"
if self.failures >= self.max_failures:
return False, "failure cap reached (circuit breaker)"
return True, None
def record(self, success):
self.requests += 1
if not success:
self.failures += 1
budget = ScrapeBudget(max_requests=1000, max_duration_seconds=3600, max_failures=20)
for url in target_urls:
ok, reason = budget.allow()
if not ok:
print(f"Aborting: {reason}")
break
response = fetch_with_backoff(url)
budget.record(success=response is not None)
time.sleep(1) # rate limiting still applies even with rotation
Expected: Budget caps trigger before runaway cost. Logs show per-proxy success rate so a bad egress IP can be identified and excluded.
On failure:
gateway., proxy=, the provider hostname).env (or equivalent) is in .gitignorerobots.txt is still respected — rotation does not override itStealthyFetcher and rate limiting first; rotation is
expensive and unethical to deploy unnecessarily.robots.txt because "we have rotation now": rotation does
not grant permission. The directive is the directive.npx claudepluginhub pjt222/agent-almanacEnforces Firecrawl scraping policies with domain blocklists, credit budgets, content filtering, and robots.txt compliance. Use for compliant, cost-controlled web crawls.
Generates working proxy code for Bright Data's datacenter, ISP, residential, and mobile networks. Handles URL format, targeting, SSL setup, and Python/Node/browser framework integration.
Provides TypeScript patterns for Bright Data proxy integrations: singleton axios client, retry wrappers for scraping with session, country, and error handling.