Skill

synthdata-generate

Generate synthetic tabular datasets from YAML schemas. Use this skill when the user wants to create sample data, mock data, test data, synthetic datasets, or demo data for any domain — HR directories, e-commerce orders, SaaS metrics, healthcare records, financial transactions, security events, application logs, IoT sensor readings, CRM pipelines, survey responses, or custom schemas. Ships with 10+ domain templates and supports custom YAML schemas with Faker-backed fields, statistical distributions (normal/lognormal/zipf/poisson), foreign-key integrity, behavioral profiles, and temporal event generation. Also trigger when user says "generate synthetic data", "create fake data", "mock dataset", "test data", or names a specific domain like "e-commerce data" or "HR data".

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/synthdata:synthdata-generate

User invocable

Model invocable

Inline context

Default effort

Tool Access

This skill is limited to the following tools:

Read Bash Glob Write

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Generate synthetic tabular datasets using **bundled Python scripts** — no code generation required. A schema-driven engine reads YAML and produces xlsx, csv, json, sql, or parquet output.

Supporting Files

SKILL.md

120 lines · ~1.5k tokens

Stats

LanguagePython

Stars0

MaintenanceGood

Last CommitApr 9, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Synthdata Generate

Generate synthetic tabular datasets using bundled Python scripts — no code generation required. A schema-driven engine reads YAML and produces xlsx, csv, json, sql, or parquet output.

Prerequisites

pip install openpyxl faker numpy pandas pyyaml --break-system-packages

Workflow

Step 1: Interview the user

Ask three questions:

What domain? — Offer the template list (see below). If none fit, offer blank-slate + custom columns.
What scale? — quick (demo-size), medium (default, hundreds–thousands), thorough (full-fidelity, may be 10K+ rows).
What format? — xlsx (default), csv, json, sql, or parquet.

List available templates:

python scripts/generate.py --list-templates

Present defaults as: "I'll use <template> at medium effort → xlsx by default. Which of these would you like to change?"

Wait for the user's response before proceeding.

Effort Levels

Level	Rows	Profiles	When to use
quick	Smallest row counts (template-defined)	Flat baseline (no behavioral variation)	Smoke tests, schema checks, fast iteration
medium	Default row counts	Behavioral profiles with jitter	Day-to-day use
thorough	Largest row counts	Full profile variation	Stakeholder-facing deliverables

Effort controls row counts, profile richness, and (for some templates) time window length. The schema itself (columns, FK structure, distributions) is identical across all levels.

Step 2: Run the generator

# Use a built-in template
python scripts/generate.py --template hr-directory --effort medium --output ./hr.xlsx

# Use a custom schema file
python scripts/generate.py --schema ./my-schema.yaml --effort medium --output ./out/

# Override output format
python scripts/generate.py --template saas-metrics --format json --output ./saas.json

CLI flags:

Flag	Default	Description
`--template <name>`	—	Use a bundled template (see `--list-templates`)
`--schema <path>`	—	Use a custom YAML schema file
`--effort`	`medium`	`quick` / `medium` / `thorough`
`--output`	`./synthdata_output`	File path (xlsx/json/sql) or directory (csv/multi-table json)
`--format`	schema default	Override output format: xlsx, csv, json, sql, parquet
`--seed`	`42`	Random seed for reproducibility
`--locale`	`en_US`	Faker locale (e.g. `en_GB`, `de_DE`, `ja_JP`)

Step 3: Custom schema authoring (if no template fits)

Copy templates/blank-slate.yaml as a starting point, edit, then run with --schema. See references/schema-spec.md for the full spec.

Key concepts:

Column types: id, faker, choice, int, float, bool, date, timestamp, constant, formula, ref
Distributions: uniform, normal, lognormal, exponential, poisson, gamma, pareto
Foreign keys: child tables declare foreign_key: {column, references: "table.col", distribution: uniform|zipfian}
Profiles: behavioral personas (e.g., 5% whales, 20% dormant) that can drive rows_per_parent via lam_expr
Temporal: timestamp column with pattern: uniform | business-hours | diurnal, weekday_only, start/end

Step 4: Deliver

Report row counts per table, file path, and seed. If the user wants to iterate, re-run with a different --seed or --effort.

Available Templates

Template	Tables	Use case
`hr-directory`	departments, employees	Employee directories, HRIS test data
`ecommerce-orders`	customers, products, orders	Retail/marketplace analytics, RFM analysis
`saas-metrics`	accounts, users, events, subscriptions	Product analytics, MRR dashboards
`healthcare-patients`	patients, providers, encounters, claims	EHR sandboxes, payer analytics
`financial-transactions`	customers, accounts, transactions	Banking, fraud-detection training data
`security-events`	users, devices, alerts, incidents	SIEM demos, SOC training
`log-events`	services, requests, errors	Log-analytics dashboards, observability
`iot-sensors`	devices, readings, events	IoT platform testing, anomaly detection
`crm-pipeline`	contacts, companies, deals, activities	Sales enablement, pipeline dashboards
`survey-responses`	respondents, questions, responses	Research, market surveys
`healthcare-hrm-security`	users, threats, sims, training, DLP, abuse, monthly_risk	human risk intelligence
`blank-slate`	users	Minimal starter for custom schemas

Additional Resources

references/schema-spec.md — Complete YAML schema reference
references/distributions.md — Statistical distribution guide
references/faker-fields.md — Faker method cheat sheet
examples/ — Worked examples of custom schemas

synthdata-generate

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

synthdata-generate

Invocation

Tool Access

Context Preview

Supporting Files

SKILL.md

Synthdata Generate

Prerequisites

Workflow

Step 1: Interview the user

Effort Levels

Step 2: Run the generator

Step 3: Custom schema authoring (if no template fits)

Step 4: Deliver

Available Templates

Additional Resources

Similar Skills

Synthdata Generate

Prerequisites

Workflow

Step 1: Interview the user

Effort Levels

Step 2: Run the generator

Step 3: Custom schema authoring (if no template fits)

Step 4: Deliver

Available Templates

Additional Resources

Similar Skills