Skill

scrapy-consistency

Write, review, refactor, or debug Python code that uses Scrapy (spiders, Items, ItemLoaders, pipelines, middlewares, CrawlSpider rules, feed exports) using one canonical, modern idiom set. Use this skill whenever code defines a `scrapy.Spider`, parses pages in a `parse` callback, follows pagination or detail links, passes data between callbacks, configures settings like ROBOTSTXT_OBEY / DOWNLOAD_DELAY / CONCURRENT_REQUESTS, exports scraped data, or migrates off deprecated APIs (`.extract_first()`, `.extract()`, `response.meta` for callback data, manual `urljoin` + `Request`). Trigger it even when the user just says "scrape this site," "crawl these pages," "my spider returns nothing," "why are my items empty," or shows a traceback mentioning twisted or scrapy — without saying the word "Scrapy idioms."

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/scrapy-consistency:scrapy-consistency

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Scrapy is stable and well known to models, yet generated spiders drift between eras:

SKILL.md

124 lines · ~2.1k tokens

Stats

Stars0

MaintenanceGood

Last CommitJun 12, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Scrapy — consistent, modern idioms

Scrapy is stable and well known to models, yet generated spiders drift between eras: pre-1.x style (urljoin + manual Request, .extract() everywhere, meta dicts for callback data) next to modern 2.x style (response.follow, .get()/.getall(), cb_kwargs, async def parse). Worse, models often graft non-Scrapy habits onto spiders — collecting items into lists, calling the blocking requests library inside callbacks — which silently defeats the async engine. This skill pins one canonical idiom set: Scrapy 2.x semantics with a clean Spider / Item / Pipeline separation.

Canonical idioms — always X, never Y

Always	Never	Why
`yield response.follow(href, callback=self.parse_item)`	`urljoin(response.url, href)` + `scrapy.Request(...)`	`follow` resolves relative URLs, accepts selectors and `<a>` elements directly, and is the 2.x house style.
`yield from response.follow_all(css="a.next", callback=...)`	a for-loop of manual Requests	One line, handles relative URLs and dedup-friendly Request creation.
`.get()` / `.getall()` on selectors	`.extract_first()` / `.extract()`	Old names are officially discouraged aliases since 1.4-era selectors got `get`; mixing both styles in one spider is the classic era-mix.
`response.css("h1::text").get(default="")`	`.extract_first()` then `or ""`	`default=` is explicit and avoids None-propagation.
`yield item` / `yield request` from callbacks	appending to `self.items = []` and returning at close	The engine streams items to pipelines/exports; accumulating in lists breaks backpressure, memory, and feed exports.
`cb_kwargs={"category": cat}` → `def parse_item(self, response, category)`	stuffing scraped data into `request.meta`	`meta` is shared with middlewares (proxy, retry, depth keys); `cb_kwargs` is the dedicated, signature-checked channel since 1.7.
more `scrapy.Request`s (or `response.follow`) for extra pages	`import requests` / `httpx` calls inside a spider	Blocking I/O in a callback freezes the whole Twisted reactor — every concurrent request stalls.
`Item` (or a dataclass/attrs item) + `ItemLoader` for messy extraction	ad-hoc nested dicts with inline `.strip()` chains	Items declare the schema; loaders centralize `MapCompose(str.strip)`-style cleanup; pipelines can rely on field names.
feed exports: `scrapy crawl spider -O out.json` or `FEEDS` setting	`open("out.json", "w")` inside the spider	Hand-written writers race with concurrency, skip serialization, and ignore `FEED_EXPORT_ENCODING`.
validation/dedup/persistence in `ItemPipeline.process_item`	doing it inline in `parse`	Separation keeps callbacks pure extraction; `DropItem` gives you stats for free.
per-spider tweaks in `custom_settings` (class attribute)	mutating `settings` at runtime or editing project settings for one spider	Settings are frozen once the crawler starts; `custom_settings` is the supported precedence layer.
`async def parse(self, response):` when you need `await`	wrapping coroutines in Deferred glue by hand	Native coroutine callbacks are supported in 2.x; the engine awaits them.

House style for a spider — extraction only, everything else delegated:

import scrapy
from myproject.items import ProductItem
from myproject.loaders import ProductLoader

class ProductsSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/catalog"]
    custom_settings = {"DOWNLOAD_DELAY": 0.5, "AUTOTHROTTLE_ENABLED": True}

    def parse(self, response):
        yield from response.follow_all(
            css="a.product-link", callback=self.parse_product,
            cb_kwargs={"listing_url": response.url},
        )
        yield from response.follow_all(css="a.next-page", callback=self.parse)

    def parse_product(self, response, listing_url):
        loader = ProductLoader(item=ProductItem(), response=response)
        loader.add_css("name", "h1.title::text")
        loader.add_css("price", "span.price::text")
        loader.add_value("url", response.url)
        loader.add_value("found_via", listing_url)
        yield loader.load_item()

Pitfalls that produce silently wrong results

Forgetting to yield the request: response.follow(...) returns a Request; calling it without yield does nothing, and the spider "finishes" with partial data and no error.
Filtered duplicates eating pages: the dupefilter drops already-seen URLs silently (logged at DEBUG). For deliberate re-fetches pass dont_filter=True; for "why only one page?" check the dupefilter/filtered stat first.
ROBOTSTXT_OBEY = True (the project-template default) silently skips disallowed URLs — zero items, no exception. Check the log for "Forbidden by robots.txt" before debugging selectors.
Mutable meta/cb_kwargs shared across requests: building one dict and passing it to many requests means later callbacks see each other's mutations. Build a fresh dict per request.
Selector returns None, item ships anyway: .get() quietly yields None into the item. Use default=, loaders with required-field pipelines, or validate in process_item.
Blocking calls (requests.get, time.sleep, heavy DB writes) inside callbacks don't error — they just serialize the whole crawl. Throughput collapse is the only symptom.
start_urls with a custom start_requests: defining start_requests makes start_urls dead code; keep one or the other, not both half-wired.
DOWNLOAD_DELAY vs autothrottle confusion: with AUTOTHROTTLE_ENABLED = True, DOWNLOAD_DELAY is the floor, not the rate. Per-domain concurrency (CONCURRENT_REQUESTS_PER_DOMAIN) still applies on top of CONCURRENT_REQUESTS.
Re-running CrawlerProcess twice in one Python process fails: the Twisted reactor is not restartable. Scripts should run once per process; otherwise use scrapy crawl.

Version notes

Target Scrapy 2.x. The key line is the 1.x → 2.x transition: 2.0 brought native async def callbacks and made response.follow_all available; .get()/.getall() and response.follow (1.4+) and cb_kwargs (1.7+) are the established modern spellings, with .extract_first()/.extract() kept only as discouraged aliases. All canonical idioms here run on any maintained 2.x release, so never write era-mixed spiders to be "safe."

Workflow

Sketch the data contract first: define the Item fields (and loader processors) before writing callbacks.
Write callbacks as pure generators: select with response.css(...)::text / response.xpath(...), then yield items and response.follow(...) requests. No lists, no blocking I/O, no file writes.
Pass inter-callback data via cb_kwargs; reserve meta for middleware-facing keys (proxy, download_timeout, playwright flags).
Put cleanup in loaders, validation/persistence in pipelines, and output in feed exports (-O file.json or the FEEDS setting).
Set politeness explicitly — ROBOTSTXT_OBEY, AUTOTHROTTLE_ENABLED, DOWNLOAD_DELAY, CONCURRENT_REQUESTS[_PER_DOMAIN] — project-wide in settings.py, per-spider in custom_settings.
When reviewing existing code, flag any "Never" column pattern above and rewrite it in the canonical form rather than patching around it.

For the fuller migration map (old API → modern API), expanded gotcha explanations, and more worked examples, read references/scrapy-patterns.md.

scrapy-consistency

Invocation

Context Preview

SKILL.md

scrapy-consistency

Invocation

Context Preview

SKILL.md

Scrapy — consistent, modern idioms

Canonical idioms — always X, never Y

Pitfalls that produce silently wrong results

Version notes

Workflow

Similar Skills

Scrapy — consistent, modern idioms

Canonical idioms — always X, never Y

Pitfalls that produce silently wrong results

Version notes

Workflow

Similar Skills