From systems-design
Guides designing data platforms by comparing warehouses, data lakes, lakehouses, and implementing data mesh patterns with trade-offs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/systems-design:data-architectureThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.
Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.
Generation 1: Data Warehouse (1990s-2000s)
- Structured data only
- ETL into warehouse
- Star/snowflake schemas
- SQL-based analytics
Generation 2: Data Lake (2010s)
- All data types (structured, semi, unstructured)
- Schema-on-read
- Hadoop/HDFS based
- Cheap storage, complex processing
Generation 3: Lakehouse (2020s)
- Best of both: lake flexibility + warehouse features
- ACID transactions on lake
- Schema enforcement optional
- Unified analytics and ML
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ ETL │ ──► │ Warehouse │
│ (Structured)│ │ (Transform) │ │ (Star/Snow) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ BI │
│ Analytics │
└─────────────┘
Characteristics:
- Schema-on-write
- Optimized for SQL queries
- Structured data only
- High data quality
- Expensive storage
Best for:
- Business intelligence
- Financial reporting
- Structured analytics
┌─────────────┐ ┌─────────────┐
│ Sources │ ──► │ Data Lake │
│ (All) │ │ (Raw) │
└─────────────┘ └─────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ ML │ │ ETL │ │ Spark │
│ Training│ │ to DW │ │ Analysis│
└─────────┘ └─────────┘ └─────────┘
Characteristics:
- Schema-on-read
- All data types
- Cheap storage
- Flexible processing
- Risk of "data swamp"
Best for:
- Data science/ML
- Unstructured data
- Experimental analysis
┌─────────────┐ ┌─────────────────────────────────┐
│ Sources │ ──► │ Data Lakehouse │
│ (All) │ │ ┌──────────────────────────┐ │
└─────────────┘ │ │ Metadata Layer │ │
│ │ (Delta/Iceberg/Hudi) │ │
│ └──────────────────────────┘ │
│ ┌──────────────────────────┐ │
│ │ Storage Layer │ │
│ │ (Object Storage) │ │
│ └──────────────────────────┘ │
└─────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ SQL │ │ ML │ │ Stream │
│ BI │ │ Workload│ │ Process │
└─────────┘ └─────────┘ └─────────┘
Characteristics:
- ACID transactions
- Schema evolution
- Time travel
- Unified batch/streaming
- Open formats
Best for:
- Unified analytics
- Both BI and ML
- Modern data platforms
| Factor | Warehouse | Lake | Lakehouse |
|---|---|---|---|
| Data types | Structured | All | All |
| Query performance | Excellent | Poor-Medium | Good |
| Data quality | High | Variable | Configurable |
| Cost | High | Low | Medium |
| ML workloads | Limited | Excellent | Excellent |
| Real-time | Limited | Good | Good |
| Governance | Strong | Weak | Strong |
| Complexity | Low | High | Medium |
Decision Tree:
Is data mostly structured with BI focus?
├── Yes → Data Warehouse
└── No
└── Need ML + BI on same data?
├── Yes → Lakehouse
└── No
└── Primarily ML/unstructured?
├── Yes → Data Lake
└── No → Lakehouse
Features:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement/evolution
- Unified batch/streaming
- Optimized performance (Z-ordering, compaction)
File format: Parquet + Delta log
Features:
- ACID transactions
- Hidden partitioning
- Schema evolution
- Time travel
- Vendor neutral
File format: Parquet/ORC/Avro + metadata
Features:
- ACID transactions
- Incremental processing
- Record-level updates
- Time travel
- Optimized for streaming
File format: Parquet + Hudi metadata
| Feature | Delta Lake | Iceberg | Hudi |
|---|---|---|---|
| ACID | Yes | Yes | Yes |
| Time Travel | Yes | Yes | Yes |
| Schema Evolution | Good | Excellent | Good |
| Streaming | Excellent | Good | Excellent |
| Ecosystem | Databricks | Wide | Wide |
| Performance | Excellent | Excellent | Good |
| Community | Large | Growing | Medium |
Data Mesh = Decentralized data architecture
Four Principles:
1. Domain Ownership
- Data owned by domain teams
- Not centralized data team
2. Data as a Product
- Treat data like a product
- Quality, discoverability, usability
3. Self-Serve Platform
- Platform enables domain teams
- Reduces friction
4. Federated Governance
- Global standards
- Local implementation
Data Product = Autonomous unit of data
Components:
┌──────────────────────────────────────┐
│ Data Product │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Data │ │ Metadata │ │
│ │ (Tables) │ │ (Schema, docs) │ │
│ └──────────┘ └──────────────────┘ │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ Code │ │ APIs │ │
│ │ (ETL) │ │ (Access layer) │ │
│ └──────────┘ └──────────────────┘ │
│ ┌──────────────────────────────────┐│
│ │ Quality + SLAs ││
│ └──────────────────────────────────┘│
└──────────────────────────────────────┘
| Aspect | Centralized | Data Mesh |
|---|---|---|
| Ownership | Central data team | Domain teams |
| Scaling | Team bottleneck | Scales with org |
| Domain knowledge | Lost in translation | Preserved |
| Governance | Centralized | Federated |
| Implementation | Uniform | Heterogeneous |
| Complexity | Lower initially | Higher initially |
┌─────────────┐
│ Dim_Time │
└──────┬──────┘
│
┌───────────┐ │ ┌───────────┐
│Dim_Product├──┼──┤Dim_Customer│
└───────────┘ │ └───────────┘
│
┌──────┴──────┐
│ Fact_Sales │
└─────────────┘
Pros: Simple, fast queries
Cons: Denormalized, redundancy
Best for: BI, reporting
Normalized dimensions:
Dim_Product → Dim_Category → Dim_Subcategory
Pros: Less redundancy
Cons: More joins, slower
Best for: Complex hierarchies
Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)
Pros: Auditable, flexible, scalable
Cons: Complex, learning curve
Best for: Enterprise data warehouse
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Bronze │ ──► │ Silver │ ──► │ Gold │
│ (Raw) │ │(Cleaned)│ │(Curated)│
└─────────┘ └─────────┘ └─────────┘
Bronze: Raw ingestion, append-only
Silver: Cleaned, validated, conformed
Gold: Business-level aggregates, features
Landing Zone: Raw files from sources
Raw Zone: Structured raw data
Curated Zone: Transformed, quality-checked
Consumption Zone: Ready for analytics
Sandbox Zone: Exploration and experimentation
Implement quality gates:
- Schema validation
- Null checks
- Range validation
- Referential integrity
- Freshness monitoring
Key capabilities:
- Data catalog
- Lineage tracking
- Access control
- Privacy compliance
- Audit logging
Optimization techniques:
- Partitioning (by date, region)
- Clustering/Z-ordering
- Compaction
- Caching
- Materialized views
etl-elt-patterns - Data transformationstream-processing - Real-time datadatabase-scaling - Database patternsnpx claudepluginhub melodic-software/claude-code-plugins --plugin systems-designOutlines data architecture patterns (warehouse, lake, lakehouse, mesh) with decision trees, schema designs (star/snowflake), medallion zones, Kimball/Inmon, ETL/ELT pipelines, streaming, and scaling strategies.
Provides Snowflake architecture blueprints: traditional data warehouse, Iceberg lakehouse, data mesh with sharing. Includes diagrams and SQL for multi-scale deployments.
Designs data pipeline architectures for batch ETL, streaming, or hybrid scenarios including tech stacks, ASCII diagrams, data quality strategies, and cost analysis. Useful for real-time processing, BI reporting, or migrations.