Guides production Spark Structured Streaming pipelines with Kafka ingestion, stream joins, watermarks, checkpoints, triggers, multi-sink writes, and cost tuning.
How this skill is triggered — by the user, by Claude, or both
Slash command
/databricks-ai-dev-kit:databricks-spark-structured-streamingThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Production-ready streaming pipelines with Spark Structured Streaming. This skill provides navigation to detailed patterns and best practices.
Production-ready streaming pipelines with Spark Structured Streaming. This skill provides navigation to detailed patterns and best practices.
from pyspark.sql.functions import col, from_json
# Basic Kafka to Delta streaming
df = (spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "topic")
.load()
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
)
df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/Volumes/catalog/checkpoints/stream") \
.trigger(processingTime="30 seconds") \
.start("/delta/target_table")
| Pattern | Description | Reference |
|---|---|---|
| Kafka Streaming | Kafka to Delta, Kafka to Kafka, Real-Time Mode | See kafka-streaming.md |
| Stream Joins | Stream-stream joins, stream-static joins | See stream-stream-joins.md, stream-static-joins.md |
| Multi-Sink Writes | Write to multiple tables, parallel merges | See multi-sink-writes.md |
| Merge Operations | MERGE performance, parallel merges, optimizations | See merge-operations.md |
| Topic | Description | Reference |
|---|---|---|
| Checkpoints | Checkpoint management and best practices | See checkpoint-best-practices.md |
| Stateful Operations | Watermarks, state stores, RocksDB configuration | See stateful-operations.md |
| Trigger & Cost | Trigger selection, cost optimization, RTM | See trigger-and-cost-optimization.md |
| Topic | Description | Reference |
|---|---|---|
| Production Checklist | Comprehensive best practices | See streaming-best-practices.md |
npx claudepluginhub databricks-solutions/ai-dev-kit --plugin databricks-ai-dev-kitExpert in Apache Spark: DataFrame transformations, Spark SQL optimization, RDD pipelines, shuffle tuning, partitioning, and structured streaming for big data workloads.
Develops Lakeflow Spark Declarative Pipelines on Databricks for batch and streaming data pipelines using Python or SQL. Guides dataset types like Streaming Tables and features like Auto Loader, Auto CDC via decision tree.
Senior Spark engineer for writing optimized PySpark/SQL jobs, tuning performance (shuffle, partitioning, memory), and building structured streaming pipelines. Activated on big data or Spark debugging tasks.