From dataproduct-builder-databricks
Extract a small sample of rows from a Databricks output port via a non-production SQL warehouse, scrub anything classified as PII or sensitive in the data contract, and upload the scrubbed sample to Entropy Data via the entropy-data CLI. Trigger when the user asks to "upload example data", "publish sample rows for the data product", or "give consumers a preview of the data".
How this skill is triggered — by the user, by Claude, or both
Slash command
/dataproduct-builder-databricks:dataproduct-exampledataThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Sample rows let prospective consumers evaluate a data product without requesting access. This skill pulls a small sample from a non-production source via a Databricks SQL warehouse, scrubs sensitive columns, and uploads the result.
Sample rows let prospective consumers evaluate a data product without requesting access. This skill pulls a small sample from a non-production source via a Databricks SQL warehouse, scrubs sensitive columns, and uploads the result.
Before running Step 0, print this plan to the user verbatim:
Running dataproduct-exampledata. I'll:
- Pre-checks: Databricks bundle, ODCS files,
entropy-dataCLI,databricksCLI authenticated, a non-prod SQL warehouse and catalog.- Identify the output port and its contract.
- Build a scrub plan — drop PII/sensitive columns, hash IDs, drop free text. Wait for your confirmation.
- Extract ~20 sample rows via
databricks sql queryagainst the non-prod target.- Build the example-data YAML and show the first rows. Wait for your confirmation.
- Upload via
entropy-data example-data put.- Summarize what was uploaded, what was scrubbed, and cleanup options.
Then proceed.
databricks.yml exists at the working directory root (this is a Declarative Automation Bundle).src/output_ports/.uv run --quiet entropy-data --version succeeds from the project root. If it fails, run uv sync and retry; if still missing, stop and tell the user to verify entropy-data is listed in pyproject.toml's [dependency-groups].dev. Use uv run entropy-data … for every CLI invocation in this skill.databricks --version is on PATH and databricks auth describe succeeds. If not, stop and tell the user to run databricks auth login --host <host>.databricks.yml targets: and pick a target whose mode: is development or whose catalog is clearly non-prod (e.g. <user>_dev). Never use a prod target in this skill. If only a prod target exists, stop and tell the user to add a dev target first.--warehouse-id the user supplies, (b) the warehouse_id declared on the chosen target in databricks.yml, (c) the first warehouse from databricks warehouses list -o json that is running. If none is available, stop and ask the user to start one.If multiple output ports exist, ask which one. For each candidate, you need:
OUTPUT_PORT_ID (from <id>.odps.yaml)src/output_ports/v<N>/<file>.odcs.yamlcatalog.schema.table from the server blockRead the contract's field list. For each field, decide what to do with it:
| Contract signal | Action |
|---|---|
classification: pii (or confidential, restricted) | Drop the column from the sample |
Field name matches obvious PII patterns (email, phone, ssn, passport, dob, birth_date, iban, address, name, first_name, last_name) and no classification | Treat as PII, drop unless the user explicitly opts in |
tags containing pii / sensitive / gdpr | Drop |
| Numeric ID that could be a customer/user identifier | Hash with a one-way function and prefix sample_ (use Spark SQL sha2(cast(<col> as string), 256)) |
Free-text comment/note/description columns | Drop unless the user explicitly opts in (free text often leaks PII not declared in the contract) |
| Everything else | Keep |
Show the user the scrub plan as a table — column → action — and wait for confirmation before extracting any data.
Build the SQL against the non-prod catalog from Step 0. Use Unity Catalog three-part naming:
select <kept-and-hashed-columns>
from <catalog>.<schema>.<table>
limit <N>;
Default N = 20. Preferred extraction methods, in order:
databricks api post /api/2.0/sql/statements — primary path. Works on every supported Databricks CLI version. Body shape:
{ "warehouse_id": "<id>", "statement": "<sql>", "wait_timeout": "30s" }
With wait_timeout: "30s" (the API's maximum synchronous wait), most statements return inline. If .status.state is SUCCEEDED after the POST, read rows from .result.data_array. If it returns PENDING or RUNNING, poll databricks api get /api/2.0/sql/statements/<statement-id> every 2s until terminal (SUCCEEDED / FAILED / CANCELED); cap at 5 minutes.
Column metadata for naming is at .manifest.schema.columns[].name (matches the SELECT order).
databricks sql query — optional alternative if your CLI version (≥ 2.x roadmap) ships this subcommand. Returns the same JSON shape so the post-processing is identical. Skip and use method 1 if it errors with "unknown command".
Manual fallback — if both fail, paste the SQL into the user's hands and ask them to run it in a workspace SQL editor and paste the results back.
Convert the result rows into a list of objects keyed by the contract column names (the names that will be visible to consumers, not the warehouse aliases). Hold the rows in memory as ROWS for the next step — do not write a CSV. The entropy-data example-data put command takes a JSON or YAML body, not a CSV.
Construct the document the CLI expects (shown as YAML; JSON with the same keys is equally valid and is the default this skill writes — see below):
id: <DATA_PRODUCT_ID>-<OUTPUT_PORT_ID>
dataProductId: <DATA_PRODUCT_ID>
outputPortId: <OUTPUT_PORT_ID>
dataContractId: <CONTRACT_ID>
schemaName: <model name from the ODCS schema/models block>
data:
- { <col>: <val>, ... } # one entry per row from ROWS
- ...
Field semantics confirmed against entropy-data example-data list -o json: the ID convention is <dataProductId>-<outputPortId>; schemaName is the contract's top-level schema/models key (the table name as the contract names it).
Write the document to examples/<DATA_PRODUCT_ID>-<OUTPUT_PORT_ID>.json (create examples/ if missing; add examples/ to .gitignore if absent). Default to .json so the script needs only Python's stdlib (import json); the init template's pyproject.toml does not pin pyyaml as a dev dep. If the user explicitly asks for YAML output, write .yaml instead — but then ensure pyyaml is available first via uv add --group dev pyyaml.
Print the first 5 rows of data: in a Markdown table. Re-state the dropped columns. Wait for explicit user confirmation before uploading.
entropy-data example-data put <DATA_PRODUCT_ID>-<OUTPUT_PORT_ID> \
--file examples/<DATA_PRODUCT_ID>-<OUTPUT_PORT_ID>.json
Notes on the CLI shape (verified against entropy-data example-data put --help):
--data-product / --output-port flags. By convention it is <dataProductId>-<outputPortId>; this must also match the id: field inside the document.--file accepts JSON or YAML, or - for stdin.put is upsert — running it again replaces the previous sample for that id.If the CLI errors, surface the actual error and the relevant --help output to the user — do not improvise a different command.
End with this two-part recap. Use the shared Status enum (AGENTS.md § Final-report Status enum).
Part 1 — outcome table.
| Artifact | Status | Details |
|---|---|---|
| Output port | already present | <DATA_PRODUCT_ID>/<OUTPUT_PORT_ID> |
| Scrub plan | … | <dropped-count> dropped, <hashed-count> hashed, <kept-count> kept |
| Sample extraction | … | <rows> rows via databricks sql query (warehouse <warehouse-id>, catalog <catalog>) |
| Example-data file | … | examples/<DATA_PRODUCT_ID>-<OUTPUT_PORT_ID>.json (or .yaml if explicitly chosen) |
| Upload to Entropy Data | … | entropy-data example-data put succeeded (upsert) |
Part 2 — next steps. Bullet list:
examples/<DATA_PRODUCT_ID>-<OUTPUT_PORT_ID>.{json,yaml} if it contains anything the user doesn't want left on disk.If there is nothing additional to surface, write a single line: No further action required.
examples/ belongs in .gitignore. The uploaded copy is the system of record.entropy-data example-data put is upsert and will overwrite the previous sample for the same id (<dataProductId>-<outputPortId>). Mention this in the final report so the user knows the prior sample is gone.Guides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.
npx claudepluginhub entropy-data/dataproduct-builder-databricks --plugin dataproduct-builder-databricks