From brainstorm-toolkit
Pattern guide for ingesting external data (web scrapes, third-party APIs, file imports, user-generated content) into a project's database. Covers the three ingestion patterns (discovery pipeline, seed script, direct API), plus how to author a per-source web-discovery skill: WebSearch vs headless browser, the session-cookie pattern for authenticated sites, source trust tiers, and dedup-upsert. Use when adding a new scraper, import script, or automated data-collection job — or "how do I scrape X into the DB?".
How this skill is triggered — by the user, by Claude, or both
Slash command
/brainstorm-toolkit:data-source-patternThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
When content/data needs to enter the database from an external source, pick one
When content/data needs to enter the database from an external source, pick one of three patterns below. Each has a specific fit — don't mix them.
Content that changes, that users filter/search, or that multiple features depend on belongs in the database — not hardcoded into the frontend. The frontend queries via an API; the database is the source of truth.
Use when: data changes over time and needs periodic refresh. Examples: daily deal scanning, event calendars, news feeds, product-catalog syncs.
Shape:
Scheduler (cron / watcher) → queue a job → worker script → fetch/parse →
upsert into DB → emit metric/notification
Required pieces:
VALID_JOB_TYPES
list in your worker service).scripts/<source>-discovery.py or similar) that:
id, source, external_id, payload,
discovered_at, deduped_hash (or similar), and any domain-specific columns.Gotchas:
JSONB) in addition to extracted columns, so you can
re-parse later without re-scraping.Default to WebSearch + WebFetch: public pages, search engines, RSS, JSON APIs. No browser, runs fully unattended, cheapest. This covers most sources.
Reach for a headless browser (Playwright MCP, or a Playwright/Puppeteer script) only when the source requires it:
WebFetch can't see,Prefer extracting via page.evaluate(() => …) returning structured data over
brittle deep CSS selectors — selectors rot on every site redesign.
For sites behind a login, don't script the login flow — it's fragile, trips bot detection, and leaks credentials into logs. Instead:
scripts/.<source>-session.json.Session files are live credentials — gitignore them. (The toolkit's secret scan is warn-only and won't block a commit that includes one.)
Scraping pulls in junk unless you rank sources. Bake a tier list into the skill so every run applies it the same way:
.gov/.edu, the data owner's own
site, first-party APIs, well-known org domains.A discovery skill is a thin, repeatable shape — one skill per source type:
title+location, or source+external_id). Keep the richer duplicate.INSERT … ON CONFLICT DO UPDATE/NOTHING into the target table;
store the raw payload alongside the extracted columns.Keep DB specifics (driver, connection, table names) out of the skill prose and in the project's existing helpers/conventions. The skill describes the shape; the project supplies the wiring.
To run discovery on a schedule with no human present, a watcher daemon can drive
the headless claude CLI against a job queue — "Claude as a cron worker." This
is opt-in infrastructure (needs an always-on host); for occasional refresh, just
run the skill by hand. See docs/AUTONOMOUS-DISCOVERY.md in the brainstorm-toolkit
repo for the full watcher pattern, headless-CLI invocation, and security notes.
Use when: you have a static dataset that needs to enter the DB once, not on a schedule. Examples: curated reference data, one-time partner imports, demo content.
Shape:
python3 scripts/seed_<name>.py [--force] [--dry-run]
Required pieces:
scripts/ that:
INSERT ... ON CONFLICT DO NOTHING or equivalent idempotency--dry-run flag that prints what would change without touching the DB.--force flag (optional) to re-insert even if rows exist.Gotchas:
Use when: data comes from user actions or an LLM response, synchronously. Examples: user adds a calendar event, LLM generates a plan, chat message, form submission.
Shape:
Frontend/model → POST /api/<module>/<endpoint> → router → service → DB row
Required pieces:
Gotchas:
| Scenario | Pattern |
|---|---|
| Scraping a website every day for deals | 1 (discovery) |
| Loading a curated starter dataset | 2 (seed) |
| User creating a row via the UI | 3 (direct) |
| LLM-generated content from a chat flow | 3 (direct) |
| Periodic refresh from a public API | 1 (discovery) |
| One-time migration of legacy data | 2 (seed) |
Read the project's existing examples:
scripts/*-discovery.py, scripts/seed_*.py, or
scripts/scrape-*.py is a good template.CLAUDE.md for project-specific conventions (worker framework,
DB connection helper, logging conventions).Provides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub exerias21/brainstorm-toolkit --plugin brainstorm-toolkit