Builds, updates, and troubleshoots shub-workflow crawl managers that schedule Scrapy Cloud spider jobs and react to outcomes. Covers base class selection, generator pattern, hooks, and concurrent scheduling.
How this skill is triggered — by the user, by Claude, or both
Slash command
/shub-workflow-toolkit:shub-workflow-crawl-managersThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A **crawl manager** is a workflow-manager script that schedules spider jobs on Scrapy Cloud and
A crawl manager is a workflow-manager script that schedules spider jobs on Scrapy Cloud and
reacts to how they finish. The classes live in
shub_workflow/crawl.py
on top of WorkFlowManager → BaseLoopScript. Full reference:
Appendix D: Crawl Manager Classes
(and Appendix C: Workflow Manager
for the inherited layer);
gentler intro: the Crawl Managers
tutorial. This skill builds on the shub-workflow-scripts skill (the base-class layer).
| Base class | Use when | Template |
|---|---|---|
GeneratorCrawlManager | schedule a stream of jobs from a generator of params (the usual case) | examples/generator_crawl_manager.py |
AsyncSchedulerCrawlManagerMixin (+ GeneratorCrawlManager) | the cycle schedules many jobs and you want them dispatched concurrently | examples/async_generator_crawl_manager.py |
PeriodicCrawlManager | keep one spider job running, rescheduling it after each finish | examples/periodic_crawl_manager.py |
CrawlManager | schedule a single spider job once (rare on its own; it's the base of the others) | — |
Retries / outcome hooks / throttling: examples/retries_and_hooks.py. API tables: references/api-cheatsheet.md.
GeneratorCrawlManager.name and loop_mode (both effectively required — see footguns), plus spider and
default_max_jobs, and a description.set_parameters_gen() — yield one dict per job. Flat keys become spider args; the keys
spider, units, tags, job_settings, project_id are handled specially.MAX_RETRIES + get_retry_override (or a custom bad_outcome_hook),
finished_ok_hook, can_schedule_job_with_params for throttling.__main__ boilerplate — CrawlManager().run(), or asyncio.run(CrawlManager().run())
for the async mixin.setup.py; deploy via the scrapy-cloud-deployment skill.If the project already has a base crawl-manager mixin (shared options/helpers), subclass it:
class CrawlManager(ProjectMixin, GeneratorCrawlManager) — mixin first (find existing ones with
grep -rl GeneratorCrawlManager <project>/).
WorkFlowManager mandates name (and flow_id_required = True). Set the
name class attribute, or pass it as the first positional CLI arg. No name → the script errors
at startup. Distinct managers in one workflow must have distinct names.loop_mode (class attr or --loop-mode) for periodic/generator managers. With
loop_mode = 0 the manager runs one cycle and exits — a generator manager would schedule only the
first batch and quit.spider attribute vs positional. If spider is unset, a positional spider CLI arg is
required; if set, that arg disappears. A yielded "spider" key overrides the target per job (set
spider = "" to force every yielded dict to name its spider).(spider, spider_args) was already scheduled is
skipped (bloom filter). Don't expect re-yielding identical params to reschedule; vary the args,
or override create_dupe_filter() to change capacity/error rate.MAX_RETRIES (0 = off; cancelled is never retried). Tune per retry with
get_retry_override (raise StopRetry to abort), or override bad_outcome_hook entirely.
add_job() is GeneratorCrawlManager-only — don't call it on CrawlManager/PeriodicCrawlManager.AsyncSchedulerCrawlManagerMixin: mixin first in the
bases, and launch with asyncio.run(script.run()). set_parameters_gen() stays a synchronous
generator — only scheduling is async.super().add_argparser_options() before adding args, or you lose name/spider/
--spider-args/etc.default_max_jobs defaults to 1000 (not infinite); override the attr or pass --max-running-jobs.if __name__ == "__main__":
import logging
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
CrawlManager().run() # ... or asyncio.run(CrawlManager().run()) for the async mixin
npx claudepluginhub scrapinghub/shub-workflow --plugin shub-workflow-toolkitWrites, fixes, or updates Python scripts using shub_workflow base classes for Scrapy Cloud operations: scheduling spiders, querying jobs, aggregating stats, or running as crawl managers/monitors.
Implements durable multi-step workflows on Cloudflare Workers with retries, state persistence, sleeps, event waiting, and NonRetryableError handling. Use for long-running tasks.
Provides TypeScript patterns for Apify SDK Actors with Crawlee crawlers, proxy/data management, lifecycle handling, and typed apify-client wrappers.