By scrapinghub
Build, schedule, and monitor Scrapy Cloud spider jobs using the shub-workflow library: write Python scripts with base classes for crawl management, job querying, and stats aggregation, and use the scanjobs tool to extract and plot data from job stats, logs, and items.
Use for help using shub_workflow's scanjobs tool — scanning ScrapyCloud jobs to extract and plot data from stats, logs, items or spider args. Covers building a scanjobs command line (the stat/log/item/spider-arg patterns, the postscript post-processor -c, the --plot mini-language, time windows, and output modes) and the predefined "programs" shortcut: the PROGRAMS dict in a project's scripts/scanjobs.py (a subclass of shub_workflow.utils.scanjobs.ScanJobs), invoked as `scanjobs.py -g <program> -v key:val`, including {var} placeholders and {{ }} escaping. Applies to any project whose scripts/scanjobs.py subclasses shub_workflow's ScanJobs.
Use when building, updating, fixing, or understanding a shub-workflow crawl manager — a script that schedules spider jobs on Scrapy Cloud and reacts to their outcomes, built on shub_workflow.crawl (CrawlManager / PeriodicCrawlManager / GeneratorCrawlManager / AsyncSchedulerCrawlManagerMixin) and WorkFlowManager. Covers choosing the base class, the set_parameters_gen() generator pattern, the outcome/retry/throttling hooks, async concurrent scheduling, and the name/flow-id/loop-mode rules. Use for any crawl-manager script in any project that subclasses these classes (directly or via a project base mixin).
Use when writing, fixing, or updating a Python script built on the shub_workflow.script base classes (BaseScript / BaseLoopScript / BaseLoopScriptAsyncMixin / ArgumentParserScript) — i.e. any script that operates on Scrapy Cloud: scheduling spiders or scripts, scanning/querying SC jobs, aggregating stats, or running as a crawl manager, monitor, scheduler, consumer/deliverer, or an ad-hoc CLI that talks to SC — whether it runs ON Scrapy Cloud or locally against a project. When asked to create a new "script", first confirm it is a Scrapy Cloud script (see below), since these base classes are the right tool precisely when the script deploys to or operates on Scrapy Cloud.
Own this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimOwn this plugin?
Verify ownership to unlock analytics, metadata editing, and a verified badge. GitHub access is read-only (username + org membership).
Sign in to claimBased on adoption, maintenance, documentation, and repository signals. Not a security audit or endorsement.
A set of tools for controlling processing workflow with spiders and script running in scrapinghub ScrapyCloud.
pip install shub-workflow
If you want to support s3 tools:
pip install shub-workflow[with-s3-tools]
For google cloud storage tools support:
pip install shub-workflow[with-gcs-tools]
Check Project Wiki for documentation. You can also see code tests for lots of examples of usage.
shub-workflow ships a Claude Code plugin, shub-workflow-toolkit, that gives Claude working knowledge of shub-workflow tooling. It currently bundles three skills:
scanjobs job-scanning + plotting tool and its
command-line "programs".shub_workflow.script base
classes (BaseScript / BaseLoopScript / BaseLoopScriptAsyncMixin), i.e. any script that runs
on or operates on Scrapy Cloud.CrawlManager / PeriodicCrawlManager / GeneratorCrawlManager / AsyncSchedulerCrawlManagerMixin):
the set_parameters_gen() pattern, outcome/retry hooks, and async scheduling.Install it from this repository's plugin marketplace, from inside Claude Code:
/plugin marketplace add scrapinghub/shub-workflow
/plugin install shub-workflow-toolkit@shub-workflow
To enable it automatically for a project, add it to that project's .claude/settings.json:
{
"enabledPlugins": ["shub-workflow-toolkit@shub-workflow"]
}
The plugin is unversioned (its plugin.json has no version field), so each commit pushed to
this repository is a new version. When Claude Code installs the plugin it copies it into a local
cache (~/.claude/plugins/cache/) and uses that copy — it does not read your working tree or
re-pull from GitHub on every session. You choose how new commits reach you:
Automatic. Turn on auto-update for this marketplace: run /plugin, open the Marketplaces
tab, and enable auto-update for shub-workflow (or set it in settings — see below). With this on,
Claude Code re-pulls the marketplace from GitHub and updates installed plugins at startup, so a
new session always loads the latest pushed commit. This is the low-friction option for staying
current.
{
"extraKnownMarketplaces": {
"shub-workflow": {
"source": { "source": "github", "repo": "scrapinghub/shub-workflow" },
"autoUpdate": true
}
},
"enabledPlugins": ["shub-workflow-toolkit@shub-workflow"]
}
Manual. Leave auto-update off (the default for third-party marketplaces). The cached copy stays pinned until you explicitly update — nothing changes under you between sessions. To pull the latest when you want it:
/plugin marketplace update shub-workflow # refresh the catalog from GitHub
/plugin update shub-workflow-toolkit@shub-workflow # update the installed plugin
The plugin lives in plugins/shub-workflow-toolkit/; the
marketplace manifest is .claude-plugin/marketplace.json.
The requirements for this library are defined in setup.py as usual. The Pipfile files in the repository don't define dependencies. It is only used for setting up a development environment for shub-workflow library development and testing.
For installing a development environment for shub-workflow, the package comes with Pipfile and Pipfile.lock files. So, clone or fork the repository and do:
> pipenv install --dev
> cp pre-commit .git/hooks/
for installing the environment, and:
> pipenv shell
for initiating it.
There is a script, lint.sh, that you can run everytime you need from the repo root folder, but it is also executed each time you do git commit (provided
you installed the pre-commit hook during the installation step described above). It checks code pep8 and typing integrity, via flake8 and mypy.
> ./lint.sh
npx claudepluginhub scrapinghub/shub-workflow --plugin shub-workflow-toolkitClaude Code skill pack for Apify (18 skills)
Commands for orchestrating complex workflows
Cloudflare Workflows for durable long-running execution. Use for multi-step workflows, retries, state persistence, or encountering NonRetryableError, execution failed errors.
Core skills library for Claude Code: TDD, debugging, collaboration patterns, and proven techniques
Harness-native ECC operator layer - 67 agents, 271 skills, 92 legacy command shims, reusable hooks, rules, selective install profiles, and production-ready workflows for Claude Code, Codex, OpenCode, Cursor, and related agent harnesses
Tools to maintain and improve CLAUDE.md files - audit quality, capture session learnings, and keep project memory current.