Convert a planned (or existing-but-undocumented) ingestion path into an Ingestion Plan markdown file under docs/ingestion-plans/, enforcing the Stage 2 kill-switch — an accepted plan must name at least one upstream Source Profile. Use when the user says "/plan-ingestion", "plan this ingestion", "design this pipeline", "document this pipeline", "spec this ingestion", or describes a pipeline ("we're about to ingest X", "we need to land this stream", "what's our contract for the Stripe pull", "we never wrote down how this pipeline replays") and wants the operational and contractual model captured before code is written or after the pipeline is built but undocumented. Streaming and batch are equally first-class; the interview branches on the user's answer rather than defaulting either way. Routes to /write-adr when the decision is a cross-cutting platform commitment rather than a per-pipeline plan.
How this skill is triggered — by the user, by Claude, or both
Slash command
/relentless-data-skills:plan-ingestionThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill runs a short, branching interview that converts a planned (or existing-but-undocumented) ingestion path into an **Ingestion Plan** markdown file at `docs/ingestion-plans/<pipeline>.md` in the consumer's project. It is an **interview, not a template** — every plan that lands has been pressure-tested against the Ingestion kill-switch (a named upstream Source Profile), the delivery-mode...
This skill runs a short, branching interview that converts a planned (or existing-but-undocumented) ingestion path into an Ingestion Plan markdown file at docs/ingestion-plans/<pipeline>.md in the consumer's project. It is an interview, not a template — every plan that lands has been pressure-tested against the Ingestion kill-switch (a named upstream Source Profile), the delivery-mode comparison against that profile, the streaming/batch neutrality rule, the honest-replay test, the honest-quarantine test, and the "is this actually an ADR moment?" routing test.
This skill is self-contained: the Ingestion Plan contract it writes against ships beside it at references/ingestion-plan.md, and the linter that enforces that contract ships at scripts/lint-ingestion-plan.sh. Read the contract for the frontmatter fields, enum domains, and body section order before running — do not duplicate that material into the interview; reference it.
Optional deeper reading — not required to run this skill (a per-skill install works without them): the full repo worldview (including the CONTEXT.md entries "Ingestion Plan" and "Ingestion kill-switch") in CONTEXT.md and the lifecycle reference in docs/data-engineering-101.md.
An Ingestion Plan is not a project-management plan — not a TODO list with owners and dates, not a delivery roadmap. It is the operational and contractual model of the ingestion path (idempotency, replay window, quarantine, schema evolution, pager), the structural analogue of the Source Profile one stage downstream, deliberately human-authored and never inferred from a code artefact (no dbt manifest scan, no Airflow DAG parse, no Terraform read).
The single load-bearing question this skill exists to force is:
Which Source Profile(s) under
docs/sources-of-record/does this ingestion consume?
If the user cannot point to at least one accepted-or-draft Source Profile in related_sources_of_record, the ingestion does not deserve a plan. An Ingestion Plan with no named upstream source is the structural twin of a Source Profile with no consumer DGQ, and of a DGQ with action_change: nothing — documentation that floats free of the lineage graph, the failure mode this stage exists to refuse. Surface that conclusion explicitly. Recommend the user run /profile-source first; come back to /plan-ingestion once the upstream is profiled.
Do not soften the kill-switch. Do not paper over the absence by inventing a plausible-sounding Source Profile path on the user's behalf. The kill-switch (Turn 1) is the one place in this skill where the user must speak first; a recommendation there would coach them past the very gap the skill exists to surface.
The linter enforces the structural contract — an accepted plan with an empty related_sources_of_record fails scripts/lint-ingestion-plan.sh. This skill is the upstream half: it refuses to start the interview without one named upstream Source Profile.
docs/ingestion-plans/<pipeline>.md. If the user is planning several pipelines around one source (a backfill plus a CDC tailer, an hourly pull plus a webhook reconciler), write one fully-drilled plan first, the rest queued as status: draft so a later run can pick them up.Before starting a fresh interview, scan docs/ingestion-plans/ in the consumer's project. If one or more Ingestion Plan files exist, pause and ask:
I see existing Ingestion Plans in
docs/ingestion-plans/. Do you want to:
- Continue an existing draft or needs-follow-up plan (resumable)
- Start a new plan
- List all plans and their statuses
Treat status: accepted, status: rejected, and status: superseded-by-<path> files as immutable from this skill's perspective — do not offer to edit them. If the user picks "continue", load the chosen file and resume from the first unfilled section (commonly the section that corresponds to the next unanswered turn). If they pick "list", print one line per file: <path> — <status> — <pipeline> → <destination>.
If docs/ingestion-plans/ does not exist, create it on first write — not before.
Carry out the nine turns below in order. One question per turn, not a wall of multiple questions, except where the spec explicitly says "combined turn" (Turn 8). If the user volunteers material that answers a later turn early, accept it and skip ahead — do not re-ask. Do not deliver multi-paragraph monologues; keep each prompt tight.
Two cross-cutting rules apply to every turn:
docs/sources-of-record/, DGQs in docs/requirements/, ADRs in docs/adr/, adjacent Ingestion Plans in docs/ingestion-plans/, glossary entries in CONTEXT.md — look first, then bring what you found into the turn. In particular, always read the named upstream Source Profile once it is identified in Turn 1; its delivery_mode, cadence, late_arrival_window, and operational-contract section seed Turns 3–7 directly. Do not make the user re-derive what the repo already knows. If a stakeholder term collides with CONTEXT.md's glossary, call it out the moment you notice it.Ask, verbatim or close:
Which Source Profile(s) under
docs/sources-of-record/does this ingestion consume? Point me at one or more by filename or by system/entity.
Before asking, scan docs/sources-of-record/ and list the available Source Profile files with their system/entity line and status — so the user is choosing from a concrete set, not naming a file from memory. If docs/sources-of-record/ is empty or absent, that is itself the answer: there is no profiled upstream; the ingestion does not deserve a plan yet.
Three branches:
One or more named upstream Source Profiles. The user points to one or more files with status: accepted or status: draft. Record the paths; they go into related_sources_of_record. Read each one before continuing — its delivery_mode, cadence, late_arrival_window, and operational contract seed the next several turns. Continue to Turn 2.
A profile exists but it is rejected or superseded-by-<path>. Push back: "That profile was rejected — what's the live upstream for this ingestion?" / "That profile was superseded by <path> — should we plan against the superseding one instead?" Do not start the plan against a non-live profile; the upstream link must be a source someone has actually accepted as the contract.
No Source Profile exists for the upstream, or the user cannot name one. Trigger the kill-switch. Do not write a plan. Tell the user:
There's no named upstream Source Profile for this ingestion. An Ingestion Plan without a profiled source is documentation that floats free of the lineage graph — exactly the failure mode this stage refuses. I recommend running
/profile-sourcefirst to characterise the upstream, then coming back to/plan-ingestiononce at least one Source Profile underdocs/sources-of-record/is in place.
Then exit cleanly. Do not start drafting a plan "just in case." The kill-switch is the differentiator; surface it without apology.
Ask:
What's the pipeline —
pipelineslug anddestination— and is this one pipeline or several?
Capture two fields:
pipeline — kebab-case slug uniquely naming the pipeline (stripe-charges-into-raw, salesforce-account-cdc, orders-into-lake, partner-fulfilment-nightly). Aim for a name that reads as "what flows where," not a code-repo name.destination — kebab-case slug or path naming where landed raw data lives (raw.stripe_charges, s3://lake/raw/orders/, kafka:raw.salesforce.account). The shape that Stage 4 (Transformation) will read from.Listen for plural pipelines hiding inside one ask. "Plan the Salesforce ingestion" often unpacks into several plans — salesforce-account-backfill (one-off pull) plus salesforce-account-cdc (continuous tail) plus salesforce-account-nightly-reconcile (drift catch). Each is a separate Ingestion Plan; they have genuinely different replay strategies, schemas-evolution stances, and pagers.
Branching rule. If more than one pipeline surfaces:
docs/ingestion-plans/<pipeline>.md with status: draft, frontmatter populated with the upstream Source Profile(s) from Turn 1, the pipeline/destination identifying the stub, all other linter-required enums set to safe placeholders (delivery_mode: pull-batch, cadence: tbd if the cadence enum is free-form in the project — otherwise ad-hoc, sensitivity: internal, ordering: ingest-time, dedup_key: not-applicable, schema_evolution: none, replay_strategy: not-replayable, quarantine: none), and a note in ## Open follow-ups that this plan needs resumption. Body sections present but with TBD bullets, so the linter passes.Ask:
The named Source Profile records
delivery_mode: <profile's delivery_mode>. Is this ingestion consuming it as<that>, or transforming it into something else (e.g., acdcsource pulled intopull-batchsnapshots, or astreamingsource webhook-bridged)?
Map the answer to delivery_mode: one of cdc | pull-batch | push-batch | streaming | webhook | manual-export. The enum is deliberately identical to the Source Profile's. Do not default. Streaming and batch are both first-class in this repo; the framing must come from the user's answer, not from a built-in lean. If the Ingestion Plan's delivery_mode disagrees with the Source Profile's, that disagreement is allowed but must be recorded in the body's ## Delivery model section as a deliberate architectural choice with a one-line rationale — or interrogated as a smell if the user cannot name a reason.
Branch the follow-up questions on the user's answer. Ask only the relevant sub-questions; don't ask the streaming questions of a batch ingestion and vice versa. (Turn 4 captures these in detail; Turn 3's job is just to nail down delivery_mode and surface the comparison.)
Ask, branched:
Streaming / CDC / webhook:
What's the ordering guarantee — per-partition, global, none? What's the partition key, and is it the natural key for the entity? How do you identify duplicates — message id, composite key, content hash, not at all?
Map answers to:
ordering — one of source-order | event-time | ingest-time | unordered.dedup_key — one or more fields, OR the literal not-applicable.Pull-batch / push-batch / manual-export:
What does idempotency on re-run look like — if the pipeline reruns the same window, does it produce the same result? What's the dedup key, and is the answer "we accept duplicates" honest here?
Map answers to:
ordering — most commonly ingest-time for batches, but ask; event-time is a real answer when the source emits an event timestamp the team trusts.dedup_key — one or more fields, OR the literal not-applicable.Both branches: the honest answer "we do not have a dedup key and we accept duplicates" is dedup_key: not-applicable plus a body rationale in ## Failure & replay (or ## Delivery model if the source genuinely cannot produce duplicates). It is a written-down decision, not a soft-skip — the linter accepts it as deliberate non-empty. Do not record not-applicable to bypass the question.
Record the answers in the body's ## Delivery model section as a short bulleted list — one bullet per sub-question, not prose. Streaming plans and batch plans end up the same shape: a list of named contracts, not a wall of text.
Ask:
When the source adds a column tomorrow, what happens? When a type widens (int → bigint, varchar(50) → varchar(255))? When a column is renamed?
Map the answer to schema_evolution: one of contract-versioned | additive-only | breaking-allowed | none.
contract-versioned — there is a written contract (schema registry, dual-signed data contract) and version bumps are negotiated. Aspirational for most teams; verify before recording.additive-only — the pipeline tolerates new columns (typically by listing them in the landed table or wrapping records in an envelope) but breaks on type changes or renames. The most common honest answer when there is a stance.breaking-allowed — the team genuinely accepts breakage as a forcing function. Deliberate, not lazy. Pipeline halts on schema drift; humans fix it before re-running.none — there is no policy. Schema drift produces silent coercion, dropped columns, or pipeline errors at random. This is the honest answer when the team has not thought about it; it must read as a deliberate (and somewhat alarming) choice, not a default. Body note required.Also capture the landing shape — column-for-column passthrough, envelope wrapper with source-emit timestamp and idempotency key, Avro/Parquet/JSON, retention on the landed-raw layer — in the body's ## Landing contract section. Stage 4 reads from here; what this section commits is what Stage 4 is allowed to assume.
Push the user toward honesty. The temptation is to write additive-only because it sounds reasonable. If schema changes have never been tested, the body should say so plainly: "additive-only in design; never exercised in production."
This is the turn where template-mode regression is most likely. The skill is explicit and slow here.
Ask, in two passes:
Pass A — replay:
What does re-running this pipeline produce? If the upstream replayed the last hour after an outage, would your landed data double-count, deduplicate cleanly, or stay broken? Is replay
source-replay(you can re-fetch from upstream),warehouse-rewind(you have a durable raw landing you can re-read), ornot-replayable(and you accept that)?
Map the answer to replay_strategy: one of source-replay | warehouse-rewind | not-replayable. Three rules:
source-replay only if the team has actually exercised it. If not, the honest answer today is untested, which is not an enum value — record replay_strategy: not-replayable (or source-replay only if the user accepts the aspirational framing and notes the gap in the body) and add ## Open follow-ups line: "Replay never exercised — run a drill and confirm."status: needs-follow-up, name the specific outstanding question ("what does re-running the last hour produce?"), stop the interview, and prompt the user to go and find out. Replay-by-hope is the Stage 2 failure the kill-switch in spirit guards against, even though the structural kill-switch is upstream-chain.not-replayable is a valid honest answer — many real pipelines genuinely are not replayable, and pretending otherwise is the failure mode. If not-replayable, the body must name the compensating procedure (a backfill script, a partner export, a manual reconciliation) — not just leave it implicit.Pass B — quarantine:
What happens to malformed records — dead-letter topic, quarantine table, drop with alert, drop silently? Is there a quarantine path, or is the honest answer
quarantine: none?
Map the answer to quarantine: a path/topic/table name, OR the literal none. Two distinctions to surface explicitly:
quarantine: <path> — there is a real destination for bad rows: a _quarantine table, a dead-letter Kafka topic, an S3 prefix the team monitors. Record the path verbatim.quarantine: none — this is a deliberate choice to accept losing malformed records, with a recorded rationale in ## Failure & replay. It is not the same as "we haven't built a quarantine path yet and we drop bad rows silently." If the answer is the latter, the right state is status: needs-follow-up with quarantine: none (provisional) plus an ## Open follow-ups line: "Quarantine path not designed — decide deliberate-none or build one."The skill's stance: "Drop bad rows silently" is the most common Stage 2 failure mode in the wild, and the skill exists to make it impossible to enter by inattention. quarantine: none must be a written-down decision, not a default.
Record both answers in the body's ## Failure & replay section. This section is the single most load-bearing piece of body prose in the entire plan; do not let it be one line. Include the dedup rationale from Turn 4 if it lands here.
Ask:
When the pipeline breaks, who is paged, what's the recovery procedure, where does the runbook live, what's the freshness / lateness SLA, and what's the dependency on the upstream Source Profile's operational contract?
Five facts to capture in the body's ## Operational contract section:
## Open follow-ups.Also capture cadence (continuous | hourly | daily | weekly | ad-hoc | one-off — free-form, but stick to these forms when applicable). Often derivable from the delivery model and the upstream profile's cadence; ask explicitly only if not.
In one turn, ask the user to answer four sub-questions. Combine them — do not break this into four separate exchanges.
Quick undercurrent sweep:
- Sensitivity — is the underlying data
public,internal,confidential, orregulated? (Often inherits from the upstream Source Profile; flag if it doesn't.)- Governance — any PII, residency, or retention obligations worth recording? (e.g., "GDPR-scoped — emails and IPs land in raw", "must stay in EU region", "30-day retention on landed-raw per contract")
- Cost envelope — rough scale class: small (a handful of rows / a single API call per run), medium (recurring scheduled pull, single-digit GB per run), large (continuous stream, terabyte scan, expensive partner API).
- Orchestration — what runs this? Scheduler name (Airflow DAG, Dagster job, cron, Kafka consumer group, vendor-managed). If there is no orchestrator yet, "none — designed but not deployed" is the honest answer.
Map sensitivity into the corresponding frontmatter field. Governance, cost envelope, and orchestration live in the body — under ## Operational contract (orchestration, pager-adjacent) or ## Landing contract (governance, retention-adjacent) as appropriate — not as separate frontmatter fields. If the user says "don't know" on any, mark that with a short placeholder (unknown, tbd) and add a line to ## Open follow-ups. Do not infer values silently.
Pick a filename: <pipeline>.md under docs/ingestion-plans/, the slug kebab-case throughout (lowercase letters, digits, hyphens; single-word slugs are technically valid but at least one hyphen is recommended for readability).
Write the file shaped exactly as the contract in references/ingestion-plan.md specifies — the fourteen frontmatter fields and the six body sections in order. The earlier turns already gathered the content each field and section needs: the upstream Source Profile(s) from Turn 1, pipeline/destination from Turn 2, delivery_mode from Turn 3, ordering and dedup_key from Turn 4, schema_evolution and the landing shape from Turn 5, replay_strategy and quarantine from Turn 6, the operational contract and cadence from Turn 7, and sensitivity plus the governance/orchestration notes from Turn 8. Read the contract for the exact field list, enum domains, and section order rather than reproducing them here.
related_sources_of_record carries the Source Profile path(s) from Turn 1; the linter requires at least one entry for status: accepted. related_dgqs and related_adrs stay [] at first write — populate them later only if cross-links accrue (the upstream profile names a consumer DGQ worth recording transitively, or an ADR-moment surfaces and the user agrees to run /write-adr). Do not invent paths.
After writing, print a one-line confirmation:
Wrote docs/ingestion-plans/<pipeline>.md
status: <status>
No body summary echoed back — the user just answered the questions; reading it back is noise. Then surface the next queued plan (if Turn 2 created stubs): "You have N draft plans queued from this session — want to drill the next one now?" If the user declines, exit cleanly.
Run the bundled structural linter scripts/lint-ingestion-plan.sh — it sits beside this SKILL.md, in the skill folder's own scripts/ — as a cheap sanity check after writing. Resolve it relative to this skill folder and invoke it from the consumer project root so it picks up docs/ingestion-plans/ (or point it elsewhere with INGESTION_PLANS_ROOT=<dir>). It is best-effort: if the script is present in this install, run it; if it is absent — a partial install that dropped the skill's scripts/, say — skip it silently and rely on the consumer's pre-commit hook. It is not a CI gate — just a fast structural check that frontmatter, enums, body sections, and the kill-switch are intact. If the schema enforced by the linter drifts from this prompt, the linter wins — record a fix-up issue and ship the plan that passes the linter.
At any point in the interview, if the decision being recorded is a cross-cutting platform commitment rather than a per-pipeline plan, pause and route. Signals that you are looking at an ADR-moment:
The user says something that applies across pipelines, not just this one. Example phrasings to listen for:
"We always quarantine to the same dead-letter topic — it's our pattern." "All batch pipelines land in
raw_schemas first, then promote." "Every CDC source uses Debezium; we don't write custom CDC any more." "Idempotency key is alwayssource_id || source_emit_tsfor streaming, always the source primary key for batch."
These are platform commitments, not per-pipeline facts.
The decision is about a default delivery mode across a class of sources ("streaming for events, batch for application DBs").
The decision is about a default schema-evolution stance across pipelines ("additive-only by default; breaking-allowed requires a one-pager").
The decision is about a default replay posture across pipelines ("raw landing is the rewind point; we don't rely on upstream replay").
The decision applies across many pipelines, teams, or storage targets rather than to this one pipeline.
When you spot one of these, stop the plan interview and say something like:
This isn't a per-pipeline plan — it's a cross-cutting platform commitment. The right artefact is an ADR, not an Ingestion Plan. I recommend running
/write-adrseparately to capture this decision indocs/adr/. After the ADR lands, you can come back and link it from any related Ingestion Plans via therelated_adrsfrontmatter, and the per-pipeline plan will just reference the ADR rather than restate it.
Do not inline the ADR scaffold here. One skill, one artefact — the ADR belongs to /write-adr. If the user insists on continuing the plan, write it but note in ## Open follow-ups that the recorded pattern likely should be promoted to an ADR.
/write-adr for <topic> when you're ready — it's not part of this plan's scope."needs-follow-up, name the specific question the user needs to take to source owners, platform team, or stakeholders. Vague follow-ups rot; named ones get answered.rejected — the ingestion was planned defensively and the user decided the upstream consumer chain doesn't justify it after all — leave it in place with the rejection rationale in ## Open follow-ups. The rejection is a durable record of the deliberate decision not to build, not a TODO to revisit.Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub sleeplessv/relentless-data-skills-de --plugin relentless-data-skills