Skill

gcp-spark

From dak

Develops and executes Spark ETL pipelines on GCP Dataproc clusters and serverless. Reads/writes BigLake Iceberg, BigQuery, Spanner. Debugs failures and manages jobs.

GCP

Python

data-engineering

Popularity

Stars

110

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/dak:gcp-spark

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

> [!IMPORTANT]

Supporting Files

references/gcloud_dataproc.mdreferences/ml_tasks.mdreferences/read_write_data.mdreferences/schema_direct_inspection.mdreferences/spark_optimizations.md

SKILL.md

93 lines · ~973 tokens

Stats

LanguageTypeScript

Stars110

Forks17

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Spark on Dataproc

[!IMPORTANT]

You MUST ALWAYS follow the Task Execution Workflow when writing spark code.

Task Execution Workflow

Understand schemas: ALWAYS use @skill:discovering-gcp-data-assets skill or references/schema_direct_inspection.md to understand input and output schemas. Include the schema in your thought process BEFORE generating any code. Do NOT guess column names.
Generate spark code:
- Output Format: ALWAYS generate code in Python Notebooks (.ipynb) format. Generate scripts (.py) only if explicitly requested.
- Read and Write data: ALWAYS Refer to references/read_write_data.md when reading or writing data.
- ML Tasks: Refer to @skill:ml-best-practices skill and references/ml_tasks.md when generating ML code.
- Spark Optimizations: ALWAYS refer to references/spark_optimizations.md when generating spark code and apply optimization whenever applicable.
Verify schema before write: ALWAYS verify that the dataframe and destination schema match, use df.printSchema() for dataframe schema and refer to @skill:discovering-gcp-data-assets skill or references/schema_direct_inspection.md to verify destination schema.
Compile code before executing: For notebooks convert them to python script using jupyter nbconvert --to script your-notebook.ipynb first, then compile code using python3 -m py_compile your-notebook.py.
Execute script: ONLY when generating a .py script refer to references/gcloud_dataproc.md on writing command to execute generated code on Dataproc. This DOES NOT apply when generating notebooks.

Common Mistakes Checklist

[!CAUTION]

Ensure you verify this checklist to avoid mistakes

Before submitting a job, verify:

All imports present (col, when, lit, etc. from pyspark.sql.functions)
vector_to_array from correct module use from pyspark.ml.functions import vector_to_array (NOT pyspark.sql.functions)
DataFrame schema matches target Iceberg table verify with df.printSchema() before writing
CSV files read with header and inferSchema without these, the header row becomes data and all columns are strings
Avoid toPandas() Converting a pyspark dataframe to pandas by calling toPandas() can lead to out of memory errors. Only acceptable for building visualizations in Spark 3.5

IAM Requirements

The Dataproc service account needs:

roles/dataproc.worker: Job execution
roles/biglake.admin: Iceberg table management
roles/bigquery.jobUser: Query materialization
roles/storage.objectUser: Read/write GCS
roles/spanner.databaseUser: Spanner writes

Spark resource management

Refer to references/gcloud_dataproc.md for detailed guidelines on managing Spark clusters, jobs, batches, and interactive sessions.

gcp-spark

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

gcp-spark

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Spark on Dataproc

Task Execution Workflow

Common Mistakes Checklist

IAM Requirements

Spark resource management

Similar Skills

Spark on Dataproc

Task Execution Workflow

Common Mistakes Checklist

IAM Requirements

Spark resource management

Similar Skills