From synthdata
Generate synthetic tabular datasets from YAML schemas. Use this skill when the user wants to create sample data, mock data, test data, synthetic datasets, or demo data for any domain — HR directories, e-commerce orders, SaaS metrics, healthcare records, financial transactions, security events, application logs, IoT sensor readings, CRM pipelines, survey responses, or custom schemas. Ships with 10+ domain templates and supports custom YAML schemas with Faker-backed fields, statistical distributions (normal/lognormal/zipf/poisson), foreign-key integrity, behavioral profiles, and temporal event generation. Also trigger when user says "generate synthetic data", "create fake data", "mock dataset", "test data", or names a specific domain like "e-commerce data" or "HR data".
How this skill is triggered — by the user, by Claude, or both
Slash command
/synthdata:synthdata-generateThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate synthetic tabular datasets using **bundled Python scripts** — no code generation required. A schema-driven engine reads YAML and produces xlsx, csv, json, sql, or parquet output.
references/distributions.mdreferences/faker-fields.mdreferences/schema-spec.mdscripts/engine/__init__.pyscripts/engine/distributions.pyscripts/engine/faker_fields.pyscripts/engine/profiles.pyscripts/engine/relationships.pyscripts/engine/schema.pyscripts/engine/temporal.pyscripts/engine/writers/__init__.pyscripts/engine/writers/csv.pyscripts/engine/writers/json.pyscripts/engine/writers/parquet.pyscripts/engine/writers/sql.pyscripts/engine/writers/xlsx.pyscripts/generate.pytemplates/blank-slate.yamltemplates/crm-pipeline.yamltemplates/ecommerce-orders.yamlGenerate synthetic tabular datasets using bundled Python scripts — no code generation required. A schema-driven engine reads YAML and produces xlsx, csv, json, sql, or parquet output.
pip install openpyxl faker numpy pandas pyyaml --break-system-packages
Ask three questions:
blank-slate + custom columns.quick (demo-size), medium (default, hundreds–thousands), thorough (full-fidelity, may be 10K+ rows).List available templates:
python scripts/generate.py --list-templates
Present defaults as: "I'll use <template> at medium effort → xlsx by default. Which of these would you like to change?"
Wait for the user's response before proceeding.
| Level | Rows | Profiles | When to use |
|---|---|---|---|
| quick | Smallest row counts (template-defined) | Flat baseline (no behavioral variation) | Smoke tests, schema checks, fast iteration |
| medium | Default row counts | Behavioral profiles with jitter | Day-to-day use |
| thorough | Largest row counts | Full profile variation | Stakeholder-facing deliverables |
Effort controls row counts, profile richness, and (for some templates) time window length. The schema itself (columns, FK structure, distributions) is identical across all levels.
# Use a built-in template
python scripts/generate.py --template hr-directory --effort medium --output ./hr.xlsx
# Use a custom schema file
python scripts/generate.py --schema ./my-schema.yaml --effort medium --output ./out/
# Override output format
python scripts/generate.py --template saas-metrics --format json --output ./saas.json
CLI flags:
| Flag | Default | Description |
|---|---|---|
--template <name> | — | Use a bundled template (see --list-templates) |
--schema <path> | — | Use a custom YAML schema file |
--effort | medium | quick / medium / thorough |
--output | ./synthdata_output | File path (xlsx/json/sql) or directory (csv/multi-table json) |
--format | schema default | Override output format: xlsx, csv, json, sql, parquet |
--seed | 42 | Random seed for reproducibility |
--locale | en_US | Faker locale (e.g. en_GB, de_DE, ja_JP) |
Copy templates/blank-slate.yaml as a starting point, edit, then run with --schema. See references/schema-spec.md for the full spec.
Key concepts:
id, faker, choice, int, float, bool, date, timestamp, constant, formula, refuniform, normal, lognormal, exponential, poisson, gamma, paretoforeign_key: {column, references: "table.col", distribution: uniform|zipfian}rows_per_parent via lam_exprtimestamp column with pattern: uniform | business-hours | diurnal, weekday_only, start/endReport row counts per table, file path, and seed. If the user wants to iterate, re-run with a different --seed or --effort.
| Template | Tables | Use case |
|---|---|---|
hr-directory | departments, employees | Employee directories, HRIS test data |
ecommerce-orders | customers, products, orders | Retail/marketplace analytics, RFM analysis |
saas-metrics | accounts, users, events, subscriptions | Product analytics, MRR dashboards |
healthcare-patients | patients, providers, encounters, claims | EHR sandboxes, payer analytics |
financial-transactions | customers, accounts, transactions | Banking, fraud-detection training data |
security-events | users, devices, alerts, incidents | SIEM demos, SOC training |
log-events | services, requests, errors | Log-analytics dashboards, observability |
iot-sensors | devices, readings, events | IoT platform testing, anomaly detection |
crm-pipeline | contacts, companies, deals, activities | Sales enablement, pipeline dashboards |
survey-responses | respondents, questions, responses | Research, market surveys |
healthcare-hrm-security | users, threats, sims, training, DLP, abuse, monthly_risk | human risk intelligence |
blank-slate | users | Minimal starter for custom schemas |
references/schema-spec.md — Complete YAML schema referencereferences/distributions.md — Statistical distribution guidereferences/faker-fields.md — Faker method cheat sheetexamples/ — Worked examples of custom schemasProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.
npx claudepluginhub rappdw/synthdata