From scrapy-consistency
Write, review, refactor, or debug Python code that uses Scrapy (spiders, Items, ItemLoaders, pipelines, middlewares, CrawlSpider rules, feed exports) using one canonical, modern idiom set. Use this skill whenever code defines a `scrapy.Spider`, parses pages in a `parse` callback, follows pagination or detail links, passes data between callbacks, configures settings like ROBOTSTXT_OBEY / DOWNLOAD_DELAY / CONCURRENT_REQUESTS, exports scraped data, or migrates off deprecated APIs (`.extract_first()`, `.extract()`, `response.meta` for callback data, manual `urljoin` + `Request`). Trigger it even when the user just says "scrape this site," "crawl these pages," "my spider returns nothing," "why are my items empty," or shows a traceback mentioning twisted or scrapy — without saying the word "Scrapy idioms."
How this skill is triggered — by the user, by Claude, or both
Slash command
/scrapy-consistency:scrapy-consistencyThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Scrapy is stable and well known to models, yet generated spiders drift between eras:
Scrapy is stable and well known to models, yet generated spiders drift between eras:
pre-1.x style (urljoin + manual Request, .extract() everywhere, meta dicts for
callback data) next to modern 2.x style (response.follow, .get()/.getall(),
cb_kwargs, async def parse). Worse, models often graft non-Scrapy habits onto spiders —
collecting items into lists, calling the blocking requests library inside callbacks —
which silently defeats the async engine. This skill pins one canonical idiom set:
Scrapy 2.x semantics with a clean Spider / Item / Pipeline separation.
| Always | Never | Why |
|---|---|---|
yield response.follow(href, callback=self.parse_item) | urljoin(response.url, href) + scrapy.Request(...) | follow resolves relative URLs, accepts selectors and <a> elements directly, and is the 2.x house style. |
yield from response.follow_all(css="a.next", callback=...) | a for-loop of manual Requests | One line, handles relative URLs and dedup-friendly Request creation. |
.get() / .getall() on selectors | .extract_first() / .extract() | Old names are officially discouraged aliases since 1.4-era selectors got get; mixing both styles in one spider is the classic era-mix. |
response.css("h1::text").get(default="") | .extract_first() then or "" | default= is explicit and avoids None-propagation. |
yield item / yield request from callbacks | appending to self.items = [] and returning at close | The engine streams items to pipelines/exports; accumulating in lists breaks backpressure, memory, and feed exports. |
cb_kwargs={"category": cat} → def parse_item(self, response, category) | stuffing scraped data into request.meta | meta is shared with middlewares (proxy, retry, depth keys); cb_kwargs is the dedicated, signature-checked channel since 1.7. |
more scrapy.Requests (or response.follow) for extra pages | import requests / httpx calls inside a spider | Blocking I/O in a callback freezes the whole Twisted reactor — every concurrent request stalls. |
Item (or a dataclass/attrs item) + ItemLoader for messy extraction | ad-hoc nested dicts with inline .strip() chains | Items declare the schema; loaders centralize MapCompose(str.strip)-style cleanup; pipelines can rely on field names. |
feed exports: scrapy crawl spider -O out.json or FEEDS setting | open("out.json", "w") inside the spider | Hand-written writers race with concurrency, skip serialization, and ignore FEED_EXPORT_ENCODING. |
validation/dedup/persistence in ItemPipeline.process_item | doing it inline in parse | Separation keeps callbacks pure extraction; DropItem gives you stats for free. |
per-spider tweaks in custom_settings (class attribute) | mutating settings at runtime or editing project settings for one spider | Settings are frozen once the crawler starts; custom_settings is the supported precedence layer. |
async def parse(self, response): when you need await | wrapping coroutines in Deferred glue by hand | Native coroutine callbacks are supported in 2.x; the engine awaits them. |
House style for a spider — extraction only, everything else delegated:
import scrapy
from myproject.items import ProductItem
from myproject.loaders import ProductLoader
class ProductsSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/catalog"]
custom_settings = {"DOWNLOAD_DELAY": 0.5, "AUTOTHROTTLE_ENABLED": True}
def parse(self, response):
yield from response.follow_all(
css="a.product-link", callback=self.parse_product,
cb_kwargs={"listing_url": response.url},
)
yield from response.follow_all(css="a.next-page", callback=self.parse)
def parse_product(self, response, listing_url):
loader = ProductLoader(item=ProductItem(), response=response)
loader.add_css("name", "h1.title::text")
loader.add_css("price", "span.price::text")
loader.add_value("url", response.url)
loader.add_value("found_via", listing_url)
yield loader.load_item()
yield the request: response.follow(...) returns a Request; calling
it without yield does nothing, and the spider "finishes" with partial data and no error.dont_filter=True; for "why only one
page?" check the dupefilter/filtered stat first.ROBOTSTXT_OBEY = True (the project-template default) silently skips disallowed URLs
— zero items, no exception. Check the log for "Forbidden by robots.txt" before debugging
selectors.meta/cb_kwargs shared across requests: building one dict and passing it to
many requests means later callbacks see each other's mutations. Build a fresh dict per
request..get() quietly yields None into the item.
Use default=, loaders with required-field pipelines, or validate in process_item.requests.get, time.sleep, heavy DB writes) inside callbacks don't
error — they just serialize the whole crawl. Throughput collapse is the only symptom.start_urls with a custom start_requests: defining start_requests makes
start_urls dead code; keep one or the other, not both half-wired.DOWNLOAD_DELAY vs autothrottle confusion: with AUTOTHROTTLE_ENABLED = True,
DOWNLOAD_DELAY is the floor, not the rate. Per-domain concurrency
(CONCURRENT_REQUESTS_PER_DOMAIN) still applies on top of CONCURRENT_REQUESTS.CrawlerProcess twice in one Python process fails: the Twisted reactor is
not restartable. Scripts should run once per process; otherwise use scrapy crawl.Target Scrapy 2.x. The key line is the 1.x → 2.x transition: 2.0 brought native
async def callbacks and made response.follow_all available; .get()/.getall() and
response.follow (1.4+) and cb_kwargs (1.7+) are the established modern spellings, with
.extract_first()/.extract() kept only as discouraged aliases. All canonical idioms here
run on any maintained 2.x release, so never write era-mixed spiders to be "safe."
Item fields (and loader processors)
before writing callbacks.response.css(...)::text /
response.xpath(...), then yield items and response.follow(...) requests. No lists,
no blocking I/O, no file writes.cb_kwargs; reserve meta for middleware-facing keys
(proxy, download_timeout, playwright flags).-O file.json or the FEEDS setting).ROBOTSTXT_OBEY, AUTOTHROTTLE_ENABLED,
DOWNLOAD_DELAY, CONCURRENT_REQUESTS[_PER_DOMAIN] — project-wide in settings.py,
per-spider in custom_settings.For the fuller migration map (old API → modern API), expanded gotcha explanations, and more
worked examples, read references/scrapy-patterns.md.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub guidogl/scrapy-consistency --plugin scrapy-consistency