From posthog
Guides connecting external data sources (Postgres, Stripe, etc.) to PostHog warehouse. Covers source discovery, credential validation, table selection, sync type configuration, and creation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/posthog:setting-up-a-data-warehouse-sourceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Use this skill when the user wants to connect an external data source to PostHog's data warehouse for the first time.
Use this skill when the user wants to connect an external data source to PostHog's data warehouse for the first time.
Default to the one-step flow: data-warehouse-source-setup validates credentials, discovers every table, enables
them with sensible sync defaults (incremental where possible), and creates the source in a single call — no
schemas array to assemble. For credentials, hand the user a secure browser link with data-warehouse-source-connect-link
instead of collecting secrets in chat. Only drop to the manual wizard → db-schema → create flow when the user wants to
hand-pick which tables sync or set non-default sync types per table.
schemas payload| Tool | Purpose |
|---|---|
data-warehouse-source-connect-link | Preferred for credentials — get a secure browser/OAuth link so the user authenticates without pasting secrets in chat |
data-warehouse-source-setup | Preferred to create — one call: validate creds, discover tables, apply sync defaults, create the source |
external-data-sources-wizard | Discover which source types exist and what fields each needs (advanced flow) |
external-data-sources-db-schema | Validate credentials and list tables with available sync methods per table (advanced flow) |
external-data-sources-create | Advanced create — requires a schemas array built from the db-schema response |
external-data-sources-check-cdc-prerequisites-create | Postgres CDC pre-flight check (optional, only for Postgres CDC) |
external-data-sources-webhook-info-retrieve | Check if a source supports webhooks and whether one has been registered |
external-data-sources-create-webhook-create | Register a webhook with the external service after source creation |
external-data-sources-update-webhook-inputs-create | Supply the signing secret manually when auto-registration failed |
external-data-sources-list | After creation, confirm the source is listed and see its initial status |
external-data-schemas-list | See per-table sync status once the source is created |
Most setups should use this path — it avoids the most common failures (skipping db-schema, malformed schemas,
missing required fields).
Discover the source type and its fields (optional): external-data-sources-wizard lists every source type and
the credential fields each needs. Use it to know what to ask the user for; skip it if the source type is obvious.
Collect credentials securely: call data-warehouse-source-connect-link({source_type}). It returns a
connect_url to a minimal connect page rendering the source's full connection form — the user authorizes via
OAuth or enters credentials there, whichever the source offers (the response's auth_method tells you which to
expect). The page validates the details against a live connection and stashes them encrypted in a temporary
store — it does NOT create the source. After the user confirms they're done, find the stored credential id via
data-warehouse-stored-credentials-list (filter by source_type, newest first; the page also shows the id to
the user) and pass {"credential_id": <id>} to setup. Stored credentials are single-use — deleted as soon as
setup consumes them — and expire after 24 hours.
Never ask the user to paste raw database passwords, API keys, or OAuth tokens into the chat.
Create in one call: data-warehouse-source-setup({source_type, payload, prefix}). The server validates
credentials, discovers all tables, enables them with sync defaults (incremental where a tracking column exists,
else append, else full_refresh — never CDC), sets created_via=mcp, and creates the source. The payload carries
a credential reference ({"credential_id": ...} or the OAuth integration id key) — or inline credentials for
headless automation; no schemas array is needed. On success you get the new source id; call
external-data-schemas-list to show the user what was enabled and how each table will sync.
Notes specific to this path:
webhook key in the
setup response: on success, webhook-capable tables sync in real time and webhook-only tables (e.g. Stripe Discount)
are enabled too; on failure (e.g. the API key can't create webhooks), tables keep the polling defaults and
webhook-only tables stay disabled — relay the webhook.error to the user and offer Step 6 to register manually.
If webhook.pending_inputs is non-empty, collect those values and submit via
external-data-sources-update-webhook-inputs-create. CDC is never chosen automatically; use the advanced flow +
CDC steps for near-real-time Postgres.payload still work for headless/automation, but prefer the connect-link handoff above.Use this when the user wants to choose exactly which tables sync or set non-default sync types. Don't try to shortcut
to external-data-sources-create — you need the db-schema response to build a valid schemas payload.
┌────────────────────┐
│ 1. wizard │ What source types exist? What fields does each need?
└────────┬───────────┘
▼
┌────────────────────┐
│ 2. db-schema │ Validate creds. List tables + available sync methods per table.
└────────┬───────────┘
▼
┌────────────────────┐
│ 3. create │ Send source_type + credentials + schemas[] to actually create.
└────────────────────┘
Call external-data-sources-wizard (no params). The response is a dict keyed by source type. Each entry describes:
name — the canonical source_type string you'll pass to later calls (e.g. "Postgres", "Stripe", "Hubspot").label / caption — human-readable.fields — the config fields needed (host, port, database, api_key, client_id/secret, ...). Each has name,
type (input, password, switch, select, file-upload), and required.featured, unreleasedSource — use to gauge readiness. Skip sources marked unreleasedSource: true unless the
user explicitly asked for a preview.Match the user's request to a source. If they said "Postgres", look up Postgres. If they said something ambiguous
like "database", present the top relevant matches (Postgres, MySQL, MongoDB, BigQuery, Snowflake, Redshift) and let
them pick.
For OAuth-based sources (Hubspot, Salesforce, Google Ads), the wizard entry hints at an OAuth flow. These typically need the user to authorize in the PostHog UI rather than pasting credentials — explain this and direct them to the source setup page rather than trying to collect tokens in chat. OAuth is about authentication, not about how data flows; OAuth sources still use polling bulk sync, not webhooks.
Gather the required credentials from the user. Never ask for more fields than the wizard entry says are required —
asking for an unnecessary port when the source doesn't need one confuses users.
Call external-data-sources-db-schema with source_type plus all credential fields. This does two things at once:
message if anything is wrong (bad host,
wrong password, permission denied). Show the error verbatim — it's often actionable ("password authentication
failed for user 'x'").{
"table": "orders",
"should_sync": false,
"rows": 1_250_000,
"incremental_available": true, # can do sync_type=incremental
"append_available": true, # can do sync_type=append
"cdc_available": true, # can do sync_type=cdc (null = not enabled for team)
"supports_webhooks": false, # can do sync_type=webhook for real-time push
"incremental_fields": [ # candidates: usually updated_at, created_at, id
{"field": "updated_at", "type": "datetime", "label": "updated_at", ...},
{"field": "created_at", "type": "datetime", ...},
{"field": "id", "type": "integer", ...}
],
"detected_primary_keys": ["id"],
"available_columns": [{"field": "id", "type": "integer", "nullable": false}, ...],
"description": "..."
}
Present this to the user. Don't dump the raw JSON — summarize: which tables were found, row counts, and the default sync method recommendation per table (see sync-type decision guide).
For each table the user wants to sync, pick a sync_type. See the sync-type decision guide for detailed rules, but the short version is:
full_refresh — simple and always correct.updated_at / modified_at: incremental — much cheaper per sync.append if available — preserves history.cdc — requires primary keys and Postgres prerequisites.sync_type: "webhook" on
the tables where supports_webhooks: true, then register the webhook as a post-create step (see step 6 below).
Tables that don't support webhooks on the same source still need a bulk sync_type.For each schema that will use incremental/append/cdc, you also need:
incremental_field — which column to track for high-water-mark ordering. Pick from the incremental_fields list
returned by db-schema. Prefer updated_at over created_at (updated_at catches late-arriving updates;
created_at misses them). For integer-only tables, use the monotonically increasing primary key.incremental_field_type — must match the chosen field's type (datetime, timestamp, date, integer,
numeric, objectid).primary_key_columns — required for CDC. Use detected_primary_keys from db-schema.The source's prefix is prepended to table names in HogQL. Tables end up as {prefix}_{table_name}.
stripe, postgres.postgres_prod, postgres_analytics.Confirm the prefix with the user before creating — changing it later is possible but renames every table.
Call external-data-sources-create with:
{
"source_type": "Postgres",
"prefix": "postgres_prod",
"payload": {
"host": "...",
"port": "5432",
"dbname": "...",
"user": "...",
"password": "...",
"schema": "public",
"schemas": [
{
"name": "orders",
"should_sync": true,
"sync_type": "incremental",
"incremental_field": "updated_at",
"incremental_field_type": "datetime",
"primary_key_columns": ["id"]
},
{
"name": "users",
"should_sync": true,
"sync_type": "full_refresh"
},
{
"name": "audit_log",
"should_sync": false
}
]
}
}
Rules for the schemas array:
should_sync: false).
Tables the user didn't mention default to should_sync: false.sync_type is required only when should_sync: true.incremental_field / incremental_field_type must be present when sync_type is incremental or append.primary_key_columns must be present when sync_type is cdc.On success you'll get back a source with a new id. The first sync is triggered automatically.
sync_type: "webhook")Webhook-type schemas don't start receiving data just by existing — the external service needs to know where to POST
events, and PostHog needs to know how to verify them. This is a second call after source creation, not part of the
external-data-sources-create payload. Do this before telling the user the setup is complete, otherwise they
hear "syncs are running" while the push channel is still unregistered.
Only needed when at least one schema on the source has sync_type: "webhook" and should_sync: true. Currently only
Stripe implements this flow; for everything else skip this step.
Before calling create-webhook, check external-data-sources-webhook-info-retrieve({id}). If it already returns
exists: true, do NOT call create-webhook again — each successful call registers a new external endpoint and would
result in duplicate deliveries.
Call external-data-sources-create-webhook-create({id}). PostHog:
signing_secret and stores it securely.Returns {success, webhook_url, error}. On success report the webhook_url to the user for their records — but
they don't need to paste it anywhere; registration is already done.
If success: false with a permissions error like "API key doesn't have permission to create webhooks":
webhook_url.external-data-sources-update-webhook-inputs-create({id}, {inputs: {signing_secret: "whsec_..."}}) to
store it. The HogFunction picks it up and verifies incoming payloads.Verify with external-data-sources-webhook-info-retrieve({id}). A healthy webhook has exists: true,
external_status.status: "enabled", and no error.
Webhooks are supplementary to bulk sync. The first load of a webhook-enabled schema is still done via polling
(initial_sync_complete flips to true when done); after that, the webhook becomes the primary ingestion path. A
webhook schema will still have a sync_frequency that schedules a periodic bulk refresh as a safety net. This is
expected — not something to "fix".
After creation (and, for webhook schemas, after Step 6):
external-data-schemas-list to show the user the initial state.Running, then moves to Completed when the first sync finishes. First
syncs can take anywhere from seconds to hours depending on row count — a multi-million-row table is fine, just
slow.SELECT * FROM {prefix}_{table_name} LIMIT 10 in HogQL.If the user wants near-real-time replication from Postgres:
external-data-sources-check-cdc-prerequisites-create with their Postgres creds.
It returns {valid, errors[]} listing anything missing (wal_level, replication slot, publication, permissions).valid: false, present the errors and ask the user to fix on the Postgres side. Don't try to create a CDC
source that will immediately fail.sync_type: "cdc" on the tables that need it, and
include primary_key_columns for each (CDC requires them).Error and no tables. Skipping the validation step
just pushes the failure into the background.schemas array must correspond to a real table from the db-schema
response. You can't "also add an orders table" unless db-schema found one.data-warehouse-source-connect-link so the user
authenticates in their browser — the connect page renders the source's full connection form (OAuth and credential
options alike) and stores the result without creating the source. Don't collect OAuth tokens or database passwords
in chat; pass the credential_id reference to setup — source creation always happens through setup, not the UI.
(An already-connected OAuth integration can also be passed directly via its id key, e.g.
{"hubspot_integration_id": 123}.)sync_type: "webhook" on a schema doesn't register the
webhook — the create-webhook call does. Always follow create → create-webhook → webhook-info for webhook-type
schemas, and never leave a webhook schema dangling without registration (it just won't receive events).WebhookSource. Don't promise
webhooks for Hubspot, Salesforce, or Postgres — they'll use polling sync.sync_frequency: "24hour") as the default.npx claudepluginhub anthropics/claude-plugins-official --plugin posthogIdentifies when users need to import external data (Stripe, databases, etc.) into PostHog and guides setup via the data warehouse.
Generates a drt sync YAML configuration to connect a data warehouse table to an external service (Slack, REST API, HubSpot, Google Sheets, etc.) or set up a Reverse ETL pipeline.
Migrates Reverse ETL syncs from Census, Hightouch, Polytomic, or custom scripts to drt by mapping sources, destinations, schedules, and generating sync YAML configs.