data-architecture | systems-design

Stats

Actions

Tags

data-architecture | systems-design

Data Architecture

Modern data architecture patterns including data lakes, lakehouses, data mesh, and data platform design.

When to Use This Skill

Choosing between data lake, warehouse, and lakehouse
Designing a modern data platform
Implementing data mesh principles
Planning data storage strategy
Understanding data architecture trade-offs

Data Architecture Evolution

Generation 1: Data Warehouse (1990s-2000s)
- Structured data only
- ETL into warehouse
- Star/snowflake schemas
- SQL-based analytics

Generation 2: Data Lake (2010s)
- All data types (structured, semi, unstructured)
- Schema-on-read
- Hadoop/HDFS based
- Cheap storage, complex processing

Generation 3: Lakehouse (2020s)
- Best of both: lake flexibility + warehouse features
- ACID transactions on lake
- Schema enforcement optional
- Unified analytics and ML

Architecture Comparison

Data Warehouse

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Sources   │ ──► │     ETL     │ ──► │  Warehouse  │
│ (Structured)│     │ (Transform) │     │ (Star/Snow) │
└─────────────┘     └─────────────┘     └─────────────┘
                                              │
                                              ▼
                                        ┌─────────────┐
                                        │     BI      │
                                        │  Analytics  │
                                        └─────────────┘

Characteristics:
- Schema-on-write
- Optimized for SQL queries
- Structured data only
- High data quality
- Expensive storage

Best for:
- Business intelligence
- Financial reporting
- Structured analytics

Data Lake

┌─────────────┐     ┌─────────────┐
│   Sources   │ ──► │  Data Lake  │
│    (All)    │     │   (Raw)     │
└─────────────┘     └─────────────┘
                          │
         ┌────────────────┼────────────────┐
         ▼                ▼                ▼
    ┌─────────┐     ┌─────────┐     ┌─────────┐
    │   ML    │     │   ETL   │     │  Spark  │
    │ Training│     │ to DW   │     │ Analysis│
    └─────────┘     └─────────┘     └─────────┘

Characteristics:
- Schema-on-read
- All data types
- Cheap storage
- Flexible processing
- Risk of "data swamp"

Best for:
- Data science/ML
- Unstructured data
- Experimental analysis

Data Lakehouse

┌─────────────┐     ┌─────────────────────────────────┐
│   Sources   │ ──► │         Data Lakehouse          │
│    (All)    │     │  ┌──────────────────────────┐   │
└─────────────┘     │  │    Metadata Layer        │   │
                    │  │ (Delta/Iceberg/Hudi)     │   │
                    │  └──────────────────────────┘   │
                    │  ┌──────────────────────────┐   │
                    │  │    Storage Layer         │   │
                    │  │    (Object Storage)      │   │
                    │  └──────────────────────────┘   │
                    └─────────────────────────────────┘
                                   │
              ┌────────────────────┼────────────────────┐
              ▼                    ▼                    ▼
         ┌─────────┐         ┌─────────┐         ┌─────────┐
         │   SQL   │         │   ML    │         │  Stream │
         │   BI    │         │ Workload│         │ Process │
         └─────────┘         └─────────┘         └─────────┘

Characteristics:
- ACID transactions
- Schema evolution
- Time travel
- Unified batch/streaming
- Open formats

Best for:
- Unified analytics
- Both BI and ML
- Modern data platforms

Architecture Selection Guide

Factor	Warehouse	Lake	Lakehouse
Data types	Structured	All	All
Query performance	Excellent	Poor-Medium	Good
Data quality	High	Variable	Configurable
Cost	High	Low	Medium
ML workloads	Limited	Excellent	Excellent
Real-time	Limited	Good	Good
Governance	Strong	Weak	Strong
Complexity	Low	High	Medium

Decision Tree:

Is data mostly structured with BI focus?
├── Yes → Data Warehouse
└── No
    └── Need ML + BI on same data?
        ├── Yes → Lakehouse
        └── No
            └── Primarily ML/unstructured?
                ├── Yes → Data Lake
                └── No → Lakehouse

Lakehouse Technologies

Delta Lake (Databricks)

Features:
- ACID transactions
- Time travel (data versioning)
- Schema enforcement/evolution
- Unified batch/streaming
- Optimized performance (Z-ordering, compaction)

File format: Parquet + Delta log

Apache Iceberg (Netflix)

Features:
- ACID transactions
- Hidden partitioning
- Schema evolution
- Time travel
- Vendor neutral

File format: Parquet/ORC/Avro + metadata

Apache Hudi (Uber)

Features:
- ACID transactions
- Incremental processing
- Record-level updates
- Time travel
- Optimized for streaming

File format: Parquet + Hudi metadata

Technology Comparison

Feature	Delta Lake	Iceberg	Hudi
ACID	Yes	Yes	Yes
Time Travel	Yes	Yes	Yes
Schema Evolution	Good	Excellent	Good
Streaming	Excellent	Good	Excellent
Ecosystem	Databricks	Wide	Wide
Performance	Excellent	Excellent	Good
Community	Large	Growing	Medium

Data Mesh

Principles

Data Mesh = Decentralized data architecture

Four Principles:

1. Domain Ownership
   - Data owned by domain teams
   - Not centralized data team

2. Data as a Product
   - Treat data like a product
   - Quality, discoverability, usability

3. Self-Serve Platform
   - Platform enables domain teams
   - Reduces friction

4. Federated Governance
   - Global standards
   - Local implementation

Data Products

Data Product = Autonomous unit of data

Components:
┌──────────────────────────────────────┐
│           Data Product               │
│  ┌──────────┐  ┌──────────────────┐ │
│  │   Data   │  │     Metadata     │ │
│  │ (Tables) │  │ (Schema, docs)   │ │
│  └──────────┘  └──────────────────┘ │
│  ┌──────────┐  ┌──────────────────┐ │
│  │   Code   │  │      APIs        │ │
│  │ (ETL)    │  │  (Access layer)  │ │
│  └──────────┘  └──────────────────┘ │
│  ┌──────────────────────────────────┐│
│  │         Quality + SLAs           ││
│  └──────────────────────────────────┘│
└──────────────────────────────────────┘

Data Mesh vs Centralized

Aspect	Centralized	Data Mesh
Ownership	Central data team	Domain teams
Scaling	Team bottleneck	Scales with org
Domain knowledge	Lost in translation	Preserved
Governance	Centralized	Federated
Implementation	Uniform	Heterogeneous
Complexity	Lower initially	Higher initially

Data Modeling Patterns

Star Schema

        ┌─────────────┐
        │  Dim_Time   │
        └──────┬──────┘
               │
┌───────────┐  │  ┌───────────┐
│Dim_Product├──┼──┤Dim_Customer│
└───────────┘  │  └───────────┘
               │
        ┌──────┴──────┐
        │ Fact_Sales  │
        └─────────────┘

Pros: Simple, fast queries
Cons: Denormalized, redundancy
Best for: BI, reporting

Snowflake Schema

Normalized dimensions:
Dim_Product → Dim_Category → Dim_Subcategory

Pros: Less redundancy
Cons: More joins, slower
Best for: Complex hierarchies

Data Vault

Hub (business keys) ←→ Link (relationships) ←→ Satellite (attributes)

Pros: Auditable, flexible, scalable
Cons: Complex, learning curve
Best for: Enterprise data warehouse

Storage Layers

Bronze/Silver/Gold (Medallion Architecture)

┌─────────┐     ┌─────────┐     ┌─────────┐
│ Bronze  │ ──► │ Silver  │ ──► │  Gold   │
│  (Raw)  │     │(Cleaned)│     │(Curated)│
└─────────┘     └─────────┘     └─────────┘

Bronze: Raw ingestion, append-only
Silver: Cleaned, validated, conformed
Gold: Business-level aggregates, features

Zones in Data Lake

Landing Zone: Raw files from sources
Raw Zone: Structured raw data
Curated Zone: Transformed, quality-checked
Consumption Zone: Ready for analytics
Sandbox Zone: Exploration and experimentation

Best Practices

Data Quality

Implement quality gates:
- Schema validation
- Null checks
- Range validation
- Referential integrity
- Freshness monitoring

Governance

Key capabilities:
- Data catalog
- Lineage tracking
- Access control
- Privacy compliance
- Audit logging

Performance

Optimization techniques:
- Partitioning (by date, region)
- Clustering/Z-ordering
- Compaction
- Caching
- Materialized views

Related Skills

etl-elt-patterns - Data transformation
stream-processing - Real-time data
database-scaling - Database patterns