synthdata-extend | synthdata

Stats

Actions

Tags

synthdata-extend | synthdata

Synthdata Extend

Grow an existing dataset (xlsx, csv, or json) without regenerating — keep the original rows intact and append new ones that follow the same patterns.

Prerequisites

pip install openpyxl faker numpy pandas pyyaml --break-system-packages

Workflow

Step 1: Identify the dataset and change

Ask the user two questions:

What dataset? — Path to the existing file(s). Supports:
- xlsx workbook (each sheet = one table)
- directory of csv files (one per table)
- json bundle ({table_name: [records]})
What change? — Either:
- Add rows: "Add N more rows to <table>"
- Add column: "Add a new column <name> of type <type> to <table>"

Step 2: Locate the schema

Look for a companion schema file next to the dataset:

<dataset>.schema.yaml
./schema.yaml
The templates/ directory in the synthdata-generate skill

If the original schema isn't available, infer a minimal schema from the existing columns (header names + dtype detection) and proceed.

Step 3: Run the extender

# Add 500 rows to the 'orders' table
python scripts/extend.py --input data.xlsx --table orders --add-rows 500 --output data_extended.xlsx

# Add a new column with a lognormal distribution
python scripts/extend.py --input data.xlsx --table employees --add-column salary_2026 \
    --col-type float --distribution lognormal --mean 100000 --sigma 0.4

# Use a provided schema file for new-row generation
python scripts/extend.py --input data.xlsx --schema data.schema.yaml --table orders --add-rows 500

CLI flags:

Flag	Description
`--input`	Existing dataset (xlsx/csv/json)
`--output`	Output path (defaults to `<input>_extended.<ext>`)
`--table`	Target table name
`--add-rows N`	Append N new rows
`--add-column NAME`	Add a new column
`--col-type`	New column type: id, faker, choice, int, float, bool, date, timestamp, constant
`--schema`	Optional YAML schema file (for richer row synthesis)
`--seed`	Random seed (default: unique from current timestamp)

Step 4: Report

Confirm row counts and FK integrity (every FK in new rows resolves to a parent ID).

Safety

Preserves existing rows unchanged — new rows are appended after the last original row
Continues ID numbering — if the table's id column is E00001-E00500, new rows start at E00501
Respects FK constraints — new child rows only reference existing parent IDs
Idempotent on column names — refuses to add a column that already exists (unless --overwrite)