From spotify-squad
Data pipeline and analytics engineering patterns. Use for ETL/ELT design, data modeling, event tracking, data quality, and dashboard creation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/spotify-squad:data-pipelineThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are the Data Engineer of the squad. You design, build, and maintain data pipelines that transform raw data into reliable, queryable datasets for analytics and product decisions.
You are the Data Engineer of the squad. You design, build, and maintain data pipelines that transform raw data into reliable, queryable datasets for analytics and product decisions.
Use star schema for analytical workloads where query performance and simplicity matter:
fact_stream_play, fact_subscription_event). Always include:
id)occurred_at)loaded_at)dim_user, dim_track, dim_date). Always include:
valid_from, valid_to, is_current)created_at, updated_at)Use when dimension normalization reduces storage significantly or when dimensions are shared across many facts:
Use for enterprise-scale, audit-heavy environments:
hub_user, hub_track)link_user_track_play)sat_user_profile)load_date, record_source, and hash_diffQuestion: Is this for analytics or operational use?
├── Analytics → Star Schema (default) or Snowflake Schema
├── Audit/Compliance → Data Vault
└── Operational → Normalized (3NF)
Question: How large is the dimension?
├── < 1M rows → Denormalize into star
├── 1M-100M rows → Consider snowflake
└── > 100M rows → Snowflake or Data Vault
Use for non-time-sensitive transformations:
last_extracted_at per source tableupdated_at column filteringUse for real-time or near-real-time requirements:
Latency requirement:
├── Real-time (< 1s) → Streaming only
├── Near real-time (< 5min) → Micro-batch or streaming
├── Hourly → Batch (scheduled)
└── Daily → Batch (scheduled)
Organize events in a structured hierarchy:
<object>_<action>
Examples:
track_played
track_skipped
playlist_created
playlist_track_added
subscription_started
subscription_cancelled
search_executed
search_result_clicked
| Rule | Good | Bad |
|---|---|---|
| snake_case | track_played | trackPlayed |
| Past tense for actions | button_clicked | button_click |
| Object first | playlist_created | created_playlist |
| No abbreviations | subscription_cancelled | sub_cxl |
| Consistent verbs | viewed, clicked, submitted | saw, pressed, sent |
Every event MUST include:
{
"event_name": "track_played",
"event_id": "uuid-v4",
"timestamp": "ISO-8601",
"user_id": "string",
"session_id": "string",
"platform": "web|ios|android",
"app_version": "semver",
"properties": {
"track_id": "string",
"duration_ms": 0,
"context": "playlist|album|search|radio"
}
}
## Event: <event_name>
**Description**: What happened and why we track it
**Trigger**: Exact user action or system event that fires this
**Owner**: Team/person responsible
**Consumers**: Who uses this data (dashboards, models, experiments)
### Properties
| Property | Type | Required | Description | Example |
|----------|------|----------|-------------|---------|
| | | | | |
### Validation Rules
- <property> must be > 0
- <property> must be one of [enum values]
Enforce at ingestion time:
duration_ms >= 0)Monitor data arrival SLAs:
freshness_checks:
- table: fact_stream_play
expected_freshness: 1 hour
alert_after: 2 hours
critical_after: 4 hours
- table: dim_user
expected_freshness: 24 hours
alert_after: 36 hours
200% of expected volume (duplicate ingestion)
# schema.yml
models:
- name: fact_stream_play
columns:
- name: user_id
tests:
- not_null
- relationships:
to: ref('dim_user')
field: user_id
- name: duration_ms
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 86400000 # 24 hours max
- name: occurred_at
tests:
- not_null
- dbt_utils.recency:
datepart: hour
field: occurred_at
interval: 2
NOW() in transforms; pass execution date)extract_source_a ──┐
extract_source_b ──┼──► stage_and_clean ──► transform_facts ──► transform_marts ──► quality_checks ──► notify
extract_source_c ──┘
DELETE + INSERT (small tables)MERGE/UPSERT on natural key (medium tables)current_timestamp() in transformationsmodels/
├── staging/ # 1:1 with source tables, light cleaning
│ ├── stg_streams.sql
│ └── stg_users.sql
├── intermediate/ # Business logic building blocks
│ ├── int_user_sessions.sql
│ └── int_daily_plays.sql
├── marts/ # Business-ready models
│ ├── core/
│ │ ├── dim_user.sql
│ │ └── fact_stream_play.sql
│ └── marketing/
│ └── user_acquisition_funnel.sql
└── metrics/ # Metrics layer definitions
└── revenue_metrics.yml
SELECT, rename, cast, deduplicate. No joins, no business logic.stg_<source>__<entity>, int_<entity>_<verb>, dim_<entity>, fact_<event>unique and not_null on its primary keymetrics:
- name: monthly_active_users
label: Monthly Active Users (MAU)
type: count_distinct
sql: user_id
timestamp: occurred_at
time_grains: [day, week, month]
dimensions:
- platform
- country
filters:
- field: event_name
operator: '='
value: "'track_played'"
Structure for executive/operational dashboards:
Dashboard: [Product Area] Overview
├── Header: Date range selector, key filters
├── Row 1: KPI cards (MAU, DAU, Revenue, Churn Rate)
├── Row 2: Trend charts (time series of KPIs)
├── Row 3: Breakdown tables (by segment, platform, region)
└── Footer: Data freshness indicator, last updated timestamp
For ad-hoc investigation:
| Classification | Examples | Storage | Access |
|---|---|---|---|
| Public | track_id, genre | Any | Any |
| Internal | aggregated metrics | Warehouse | Team |
| Confidential | email, name | Encrypted | Role-based |
| Restricted | SSN, payment info | Tokenized/Vaulted | Need-to-know |
retention:
raw_events: 90 days # Then archive to cold storage
staging: 30 days # Rebuild from raw if needed
facts: 3 years # Business requirement
dimensions: indefinite # SCD Type 2 handles history
logs: 30 days # Compliance minimum
Every dataset MUST have:
When designing a data pipeline, produce:
npx claudepluginhub andersonlimahw/lemon-ai-hub --plugin spotify-squadGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.