From synthetic-data
Reference card of open-source synthetic data tools, when to use each, install commands, and design patterns.
How this skill is triggered — by the user, by Claude, or both
Slash command
/synthetic-data:tools-referenceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
A curated reference of open-source and cloud synthetic data tools. Use this skill when the user asks which tool to pick for their use case, or wants to understand the landscape.
A curated reference of open-source and cloud synthetic data tools. Use this skill when the user asks which tool to pick for their use case, or wants to understand the landscape.
What it does: Learns probability distribution from real data, then samples synthetic rows that preserve marginals, correlations, and constraints.
When to use:
Pros: Mature, multi-table support, handles mixed types, row/column constraints, reversible transformations.
Cons: Slower for very high-dimensional data; assumes stationarity.
Key models:
GaussianCopulaSynthesizer — fast, general, works well for continuous/categorical mixCTGANSynthesizer — neural-network-based, high fidelity on mixed types, slower to fitTVAESynthesizer — variational autoencoder, good balance of speed/fidelityHMASynthesizer — multi-table, learns foreign-key relationshipsInstall: pip install sdv
Quality eval: pip install sdmetrics
What it does: Plugin-based generative synthesis with support for differential privacy, fairness constraints, and time-series.
When to use:
Pros: DP guarantees, plugin architecture, time-series module.
Cons: Steeper learning curve; fewer pre-packaged models than SDV.
Install: pip install synthcity
What it does: Differentially-private tabular synthesis using Gaussian copula or uniform sampling.
When to use:
Pros: Formal privacy guarantees, simple API, fast.
Cons: Limited to DP copula approach; less flexible than SDV.
Install: pip install DataSynthesizer
What it does: GAN-based tabular and time-series synthesis.
When to use:
Pros: Strong empirical fidelity; time-series support.
Cons: Training can be unstable; hyperparameter-sensitive.
Install: pip install ydata-synthetic
What it does: Generates realistic fake values for common fields (names, emails, phone numbers, addresses, dates, credit cards, SSNs, IP addresses, etc.) with locale support.
When to use:
Pros: Simple, fast, extremely customizable, 50+ locales.
Cons: No correlation learning; purely random per-field.
Install: pip install faker
Example:
from faker import Faker
fake = Faker('de_DE')
print(fake.name(), fake.email(), fake.phone_number())
What it does: Similar to Faker — generates localized fake data with additional structure (e.g. geography, person relationships).
When to use:
Pros: Compact, fast, good geographic support.
Cons: Smaller ecosystem than Faker.
Install: pip install mimesis
What it does: LLM-based probabilistic time-series forecasting and generation.
When to use:
Pros: Pre-trained, handles multiple series lengths.
Cons: Requires external service or local inference.
Install: Via Hugging Face, huggingface_hub
What it does: Generative adversarial network for time-series synthesis.
When to use:
Pros: Good temporal fidelity.
Cons: Training instability; requires careful hypertuning.
Install: Via research implementations (e.g. GitHub repos)
What it does: Use Claude to generate or transform text records based on prompts describing persona, tone, intent, schema.
When to use:
Pros: Flexible, human-readable output, custom logic via prompts.
Cons: API costs; slower than generative models; requires API key.
Install: pip install anthropic
What it does: Similar to Claude but via OpenAI models.
When to use:
Install: pip install openai
What it does: Client for Gretel's cloud-based synthetic data platform. Uploads data, trains models in the cloud, downloads synthetic data.
When to use:
Pros: Managed service, no local compute, audit trail.
Cons: Costs; data leaves local network.
Install: pip install gretel-client
| Use case | Recommended | Alternatives |
|---|---|---|
| Real tabular → synthetic | SDV (GaussianCopula) | Synthcity, ydata-synthetic |
| Mixed-type tabular, high fidelity | SDV (CTGAN) | ydata-synthetic (GAN), Gretel |
| Privacy-sensitive tabular | Synthcity, DataSynthesizer | SDV + manual DP |
| Multi-table relational | SDV (HMA) | Synthcity with custom |
| Schema-based tabular (fake) | Faker + Mimesis | custom generation |
| PII replacement on real data | Faker + deterministic mapping | ydata-synthetic (PII aware) |
| Synthetic text records | Claude/GPT API | ydata-synthetic (text models) |
| Real text → synthetic (paraphrase) | Claude/GPT API with prompts | fine-tuned LLMs |
| Time-series forecasting | Chronos | SDV (multivariate time) |
| Time-series GAN-based | TimeGAN, ydata-synthetic | SDV (univariate) |
Next: Choose your use case, then refer to the appropriate skill (tabular-from-schema, tabular-from-real, text-records-llm, etc.) for step-by-step execution.
Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub danielrosehill/claude-code-plugins --plugin synthetic-data