Skill

Qspark-engineer

Optimizes Apache Spark jobs for distributed data processing: DataFrame transformations, Spark SQL, RDD pipelines, shuffle tuning, executor memory, partitioning, and structured streaming.

Python

SQL

data-engineering

performance

Popularity

Stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/qe-framework:Qspark-engineer

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Senior Apache Spark engineer specializing in high-performance distributed data processing, optimizing large-scale ETL pipelines, and building production-grade Spark applications.

Supporting Files

references/partitioning-caching.mdreferences/performance-tuning.mdreferences/rdd-operations.mdreferences/spark-sql-dataframes.mdreferences/streaming-patterns.md

SKILL.md

121 lines · ~1.2k tokens

Stats

LanguageJavaScript

Stars5

MaintenanceExcellent

Last CommitJun 15, 2026

Actions

View Source View Plugin View on GitHub View README

Spark Engineer

Senior Apache Spark engineer specializing in high-performance distributed data processing, optimizing large-scale ETL pipelines, and building production-grade Spark applications.

Core Workflow

Analyze — Understand data volume, transformations, latency, cluster resources
Design — Choose DataFrame vs RDD, plan partitioning, identify broadcast opportunities
Implement — Write Spark code with optimized transforms, caching, error handling
Optimize — Analyze Spark UI; tune shuffle partitions, eliminate skew, optimize joins
Validate — Check Spark UI for spill; verify partitions with df.rdd.getNumPartitions()

Code Patterns (3 Examples with Docstrings)

# Pattern 1: Schema-driven DataFrame creation
def create_typed_dataframe(spark, data: list, schema_dict: dict):
    """Create DataFrame with explicit schema and type safety."""
    from pyspark.sql.types import StructType, StructField
    schema = StructType([StructField(k, v, True) for k, v in schema_dict.items()])
    return spark.createDataFrame(data, schema=schema)

# Pattern 2: Broadcast dimension join
def broadcast_dimension_join(large_df, dim_df, join_key: str):
    """Join large fact table with small dimension (<200MB) using broadcast."""
    from pyspark.sql.functions import broadcast
    return large_df.join(broadcast(dim_df), on=join_key, how="left")

# Pattern 3: Safe caching with validation
def cache_with_validation(df, operation_name: str):
    """Cache DataFrame and materialize immediately to detect spill."""
    cached = df.cache()
    row_count = cached.count()  # Materialize now, not later
    print(f"{operation_name}: cached {row_count} rows")
    return cached

Comment Template (Google-style)

def transform_spark_data(df, threshold: float):
    """One-line transformation summary.
    
    Longer: explain Spark strategy, partition assumptions, performance implications.
    
    Args:
        df: Input Spark DataFrame
        threshold: Filtering threshold
    
    Returns:
        Transformed Spark DataFrame
    
    Raises:
        ValueError: If threshold < 0
    """

Lint Rules (ruff/mypy/black)

[tool.ruff]
line-length = 120
select = ["E", "F", "W", "UP"]
ignore = ["E501"]

[tool.mypy]
python_version = "3.10"
disallow_untyped_defs = true
ignore_missing_imports = true

Security Checklist (5+)

Credential exposure — Use Kubernetes secrets, IAM roles; never hardcode passwords
Data at rest encryption — Enable Parquet/Delta encryption; verify storage-level encryption
Access control — Enforce table/database ACLs; implement row-level security (RLS)
UDF injection — Never execute user-supplied code; use pandas_udf with schema validation
Broadcast secrets — Don't broadcast PII; validate payload < 2GB; no credentials in broadcast

Anti-patterns (5 Wrong/Correct)

Anti-pattern	Fix
`df.collect()` on large DataFrame	Use `.limit()`, write to storage, or `.sample(0.1)`
No partitioning in ETL	Set `spark.sql.shuffle.partitions = 400` or `repartition(400)`
Python UDFs without vectorization	Use `pandas_udf` or Spark SQL `F.col()` functions
Caching every intermediate DataFrame	Cache only reused DataFrames; use `.unpersist()`
Ignoring Spark UI shuffle metrics	Check UI; if shuffle spill > 10%, adjust partitions/joins

Quick Config

spark = SparkSession.builder \
    .config("spark.sql.shuffle.partitions", "400") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.memory.fraction", "0.8") \
    .getOrCreate()

MUST DO / MUST NOT DO

MUST: Define schemas, partition data (200-1000 per core), broadcast small dims, monitor Spark UI, test with prod scale
MUST NOT: Collect large data, skip schema definition, cache everything, ignore shuffle, use plain UDFs on big data

Qspark-engineer

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Qspark-engineer

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Spark Engineer

Core Workflow

Code Patterns (3 Examples with Docstrings)

Comment Template (Google-style)

Lint Rules (ruff/mypy/black)

Security Checklist (5+)

Anti-patterns (5 Wrong/Correct)

Quick Config

MUST DO / MUST NOT DO

Similar Skills

Spark Engineer

Core Workflow

Code Patterns (3 Examples with Docstrings)

Comment Template (Google-style)

Lint Rules (ruff/mypy/black)

Security Checklist (5+)

Anti-patterns (5 Wrong/Correct)

Quick Config

MUST DO / MUST NOT DO

Similar Skills