From dak
Develops and executes Spark ETL pipelines on GCP Dataproc clusters and serverless. Reads/writes BigLake Iceberg, BigQuery, Spanner. Debugs failures and manages jobs.
How this skill is triggered — by the user, by Claude, or both
Slash command
/dak:gcp-sparkThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> [!IMPORTANT]
[!IMPORTANT]
You MUST ALWAYS follow the Task Execution Workflow when writing spark code.
@skill:discovering-gcp-data-assets
skill or references/schema_direct_inspection.md to understand input and
output schemas. Include the schema in your thought process BEFORE generating
any code. Do NOT guess column names.references/read_write_data.md when reading or writing data.@skill:ml-best-practices skill and
references/ml_tasks.md when generating ML code.references/spark_optimizations.md when generating spark code and apply
optimization whenever applicable.df.printSchema() for dataframe schema and
refer to @skill:discovering-gcp-data-assets skill or
references/schema_direct_inspection.md to verify destination schema.jupyter nbconvert --to script your-notebook.ipynb first, then
compile code using python3 -m py_compile your-notebook.py..py script refer to
references/gcloud_dataproc.md on writing command to execute generated code
on Dataproc. This DOES NOT apply when generating notebooks.[!CAUTION]
Ensure you verify this checklist to avoid mistakes
Before submitting a job, verify:
col, when, lit, etc. from
pyspark.sql.functions)vector_to_array from correct module use from pyspark.ml.functions import vector_to_array (NOT pyspark.sql.functions)df.printSchema() before writingheader and inferSchema without these, the
header row becomes data and all columns are stringsThe Dataproc service account needs:
roles/dataproc.worker: Job executionroles/biglake.admin: Iceberg table managementroles/bigquery.jobUser: Query materializationroles/storage.objectUser: Read/write GCSroles/spanner.databaseUser: Spanner writesRefer to references/gcloud_dataproc.md for detailed guidelines on managing
Spark clusters, jobs, batches, and interactive sessions.
npx claudepluginhub gemini-cli-extensions/data-agent-kit-starter-pack --plugin dakExpert in Apache Spark: DataFrame transformations, Spark SQL optimization, RDD pipelines, shuffle tuning, partitioning, and structured streaming for big data workloads.
Recommends and guides GCP data pipeline tools — dbt, Dataflow, Dataform, Dataproc Spark, BigQuery DTS, Cloud Composer — based on workspace files or user requirements.
Senior Spark engineer for writing optimized PySpark/SQL jobs, tuning performance (shuffle, partitioning, memory), and building structured streaming pipelines. Activated on big data or Spark debugging tasks.