Helps build scanjobs command lines for ScrapyCloud jobs: extract data from stats, logs, items, or spider args; post-process and plot results. Covers the `-g`/`-v` program alias system with placeholder escaping.
How this skill is triggered — by the user, by Claude, or both
Slash command
/shub-workflow-toolkit:scanjobs-programsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
`scanjobs.py` scans ScrapyCloud jobs for a spider/script, extracts data from **stats**, **log
scanjobs.py scans ScrapyCloud jobs for a spider/script, extracts data from stats, log
lines, items, or spider args via regex, optionally post-processes the extracted numbers
(-c), and optionally plots them (--plot). A project subclasses
shub_workflow.utils.scanjobs.ScanJobs and defines a PROGRAMS dict so long, frequently-used
command lines get a short alias.
from shub_workflow.utils.scanjobs import ScanJobs as ShubScanJobs
class ScanJobs(ShubScanJobs):
PROGRAMS = {
"response_profile": {
"description": "Response profile for a spider. Use with -v spider:<spider>",
"command_line": [
"--project-id=<PROJECT_ID>", "{spider}",
"-s", "downloader/(response_count)",
"--data-headers=auto",
"--plot", "title={spider} response profile,no_tiles",
],
},
}
Invoke: scanjobs.py -g response_profile -v spider:myspider.
Human-readable docs. The shub-workflow wiki has a full prose guide to this tool at https://github.com/scrapinghub/shub-workflow/wiki/ScanJobs. If the user wants a walkthrough, a shareable reference, or asks where the documentation is, point them there.
-g / -v work (read this before editing PROGRAMS)PROGRAMS is consumed by ArgumentParserScript.parse_args in shub_workflow.script. The mechanics
that bite people:
command_line is a list of separate argv tokens, NOT a shell string. A flag and its value
are two list items: "-s", "downloader/(response_count)" — never "-s downloader/...".str.format(**vars) applied, where vars comes from -v k1:v1,k2:v2.
So "spider:{spider}" becomes "spider:myspider"..format() runs on every token, any real {/}
in a regex or a postscript repeat block must be escaped as {{/}}. E.g. a log regex
"total": (\d+)} is written "total\": (\\d+)}}", and a repeat block is "{{ add }}".
Forgetting this raises KeyError/ValueError at format time. See
examples/annotated-programs.md for worked cases.-g response_profile -p "5 days" runs the
program but with --period replaced. Anything left at argparse default in the program is
overridden by a value you pass explicitly — so a program can hard-code --project-id yet still be
re-pointed at another project on the CLI.To list a project's programs: read the PROGRAMS dict in its scripts/scanjobs.py, or run with a
bogus -g (e.g. -g ?) — the parser prints * <name>: <description> for each.
-s <regex> (matched against stat keys; regex groups + the stat value are emitted)-l <regex> (regex groups emitted; add --count to emit 1 per match for counting)-i <jmespath>:<regex>-a <arg>:<regex>--project-id (jobs are read outside ScrapyCloud): a numeric SC project id, or a
configured alias if the project defines one.{var} and document them in the
description ("Use with -v spider:").-c expression — see
references/postscript.md. Remember to escape repeat braces as {{ }}.--data-headers (required by --plot) and the --plot string — see
references/plot-dsl.md. title= is required.-z UTC and
--period conventions). Run the project's linter after editing.Run from the project's scripts/ dir as python scanjobs.py -g <program> [-v ...] [overrides].
It must run inside the project's environment (it imports scrapinghub, and for plotting needs
pandas/seaborn/matplotlib) — invoke through whatever the project uses (pipenv, poetry, a venv,
etc.), not a bare system python. Plots are shown interactively or, when no display is available,
saved as <uuid>.png in the cwd. A project often keeps a small shell script cataloging its common
invocations — a useful place to copy real examples from.
npx claudepluginhub scrapinghub/shub-workflow --plugin shub-workflow-toolkitWrites, fixes, or updates Python scripts using shub_workflow base classes for Scrapy Cloud operations: scheduling spiders, querying jobs, aggregating stats, or running as crawl managers/monitors.
Monitors scrapingbee-cli for suspicious activity in audit logs, stops unauthorized schedules, and blocks prompt injection from scraped content. Always active when scrapingbee-cli is installed.
Explores PostHog Signals scouts: surveys the fleet, inspects runs and reasoning, reads scratchpad memory, traces findings, and assesses health/performance over time.