From qa-test-data-privacy
Author and run Faker libraries (Python `Faker`, JavaScript `@faker-js/faker`, Java `JavaFaker`, .NET `Bogus`) for generating synthetic substitute data when masking pipelines remove real PII. Covers locale-aware generators, deterministic seeding for test reproducibility, the common provider methods (name / email / address / phone / SSN / credit card / IBAN / date / UUID / text), pytest fixture integration, and the trade-off between random vs deterministic substitution for referential integrity. Use after a PII detector flags fields that need synthetic replacement (distinct from synthetic-pii-generator which assembles fixtures from scratch - this is the underlying library skill those build skills compose).
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-test-data-privacy:faker-synthetic-dataThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Faker is the building block beneath both fresh-fixture generation
Faker is the building block beneath both fresh-fixture generation
(synthetic-pii-generator)
and PII masking pipelines
(pii-masking-pipeline-builder)
that need to replace detected PII with a plausible substitute.
Same library family across languages:
Faker (faker.readthedocs.io)@faker-js/faker (fakerjs.dev)com.github.javafaker:javafaker)Bogus NuGet package)faker gemfakerphp/fakerMethodology and provider names are similar across languages; this skill covers Python + JavaScript primarily (most widely used).
presidio-pii-detection
flags PII spans in real data, replace them with Faker output via
Presidio's custom operator wrapping a Faker call.hypothesis-testing
or fast-check).For complete fresh-fixture generation with PCI-DSS / Luhn /
region-format constraints baked in, use
synthetic-pii-generator - it's the higher-level skill that composes Faker calls into
fixture-bundle workflows.
Per faker.readthedocs.io:
pip install Faker
from faker import Faker
fake = Faker() # defaults to en_US
print(fake.name()) # "Allison Hill"
print(fake.email()) # "[email protected]"
print(fake.address()) # "778 Brown Plaza\nSouth Christine, MA..."
print(fake.phone_number()) # "001-543-810-3357x96334"
print(fake.ssn()) # "498-52-4970"
print(fake.credit_card_number(card_type="visa")) # Luhn-valid
print(fake.iban()) # "GB95...30CG"
print(fake.date_of_birth()) # datetime.date(1962, 1, 17)
print(fake.uuid4())
print(fake.paragraph(nb_sentences=3))
A US fixture and a JP fixture need different name distributions, phone formats, and address patterns:
fake_us = Faker("en_US")
fake_jp = Faker("ja_JP")
print(fake_us.name()) # "John Smith"
print(fake_jp.name()) # "山田 太郎"
print(fake_jp.address())
# 北海道札幌市中央区...
For datasets with multi-locale users, sample per row:
fake = Faker(["en_US", "ja_JP", "es_ES", "de_DE", "fr_FR"])
for _ in range(10):
print(fake.name()) # mixed locales
For reproducible test fixtures (golden-file comparison, snapshot testing):
Faker.seed(4321)
fake = Faker()
print(fake.name()) # always the same with the same seed + Faker version
Per Faker docs: "A Seed produces the same result when the same
methods with the same version of faker are called." Pin the
Faker version in requirements.txt - across versions the seeded
output drifts.
Faker ships a pytest fixture:
def test_user_creation(faker):
user = User.create(name=faker.name(), email=faker.email())
assert user.id is not None
The faker fixture is auto-seeded per test (configurable via
faker_seed marker).
Per fakerjs.dev/guide:
npm install -D @faker-js/faker
import { faker } from "@faker-js/faker";
console.log(faker.person.firstName());
console.log(faker.person.lastName());
console.log(faker.internet.email());
console.log(faker.phone.number());
console.log(faker.location.streetAddress());
console.log(faker.location.city());
console.log(faker.location.country());
console.log(faker.finance.creditCardNumber());
console.log(faker.finance.iban());
console.log(faker.string.uuid());
console.log(faker.date.past());
Locale-specific import:
import { faker as fakerJP } from "@faker-js/faker/locale/ja";
import { faker as fakerDE } from "@faker-js/faker/locale/de";
Deterministic seed:
faker.seed(123);
console.log(faker.person.firstName()); // always the same
const greeting = faker.helpers.fake(
"Hello {{person.firstName}} {{person.lastName}}!"
);
Useful for templated content (notification fixtures, email bodies).
The classic masking-pipeline integration: detect PII with Presidio,
replace with Faker. Wrap Faker in a Presidio custom operator:
from faker import Faker
from presidio_anonymizer.entities import OperatorConfig
fake = Faker()
Faker.seed(2026)
def fake_person(text, params=None):
return fake.name()
def fake_email(text, params=None):
return fake.email()
operators = {
"PERSON": OperatorConfig("custom", {"lambda": fake_person}),
"EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": fake_email}),
"PHONE_NUMBER": OperatorConfig("custom",
{"lambda": lambda text, params=None: fake.phone_number()}),
}
This produces locale-coherent replacements (a flagged Spanish email gets a Spanish-style replacement if Faker is locale-configured).
If the same email appears across multiple tables, random substitution breaks joins. Use a deterministic seed per original value:
def fake_email_deterministic(text, params=None):
Faker.seed(hash(text) & 0xFFFFFFFF)
return Faker().email()
Now [email protected] → [email protected] consistently
across every appearance.
For the broader pseudonymisation discussion see
data-masking-techniques-reference
on deterministic substitution.
Faker output is plain strings (or library-specific types like
datetime.date, decimal.Decimal). Validate per downstream
contract:
import re
email = fake.email()
assert re.fullmatch(r"[^@]+@[^@]+\.[^@]+", email)
card = fake.credit_card_number()
# Faker generates Luhn-valid numbers; verify if downstream requires
def luhn(n):
digits = [int(d) for d in n if d.isdigit()][::-1]
total = sum(d if i%2==0 else sum(divmod(d*2, 10)) for i, d in enumerate(digits))
return total % 10 == 0
assert luhn(card)
For projects that maintain fixture sets, regenerate on every CI run with a pinned seed so fixtures stay deterministic across runs but change when explicitly requested:
- run: python -m faker --seed 42 -r 100 -- 'name,email,phone_number' > fixtures.csv
Faker's CLI (python -m faker) supports CSV / JSON / YAML output.
import csv
from faker import Faker
Faker.seed(2026)
fake = Faker(["en_US", "es_ES", "ja_JP"])
with open("users.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(["id", "name", "email", "phone", "country", "dob"])
for i in range(1000):
writer.writerow([
i,
fake.name(),
fake.email(),
fake.phone_number(),
fake.country(),
fake.date_of_birth(minimum_age=18, maximum_age=80),
])
1000 synthetic users, locale-mixed, deterministic given the seed.
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Unseeded fakes in a test | Test passes today, fails tomorrow (different fake values) | Faker.seed(N) per test or use the pytest fixture |
| Random substitution where referential integrity matters | Joins break across masked tables | Deterministic seed per source value (see Running section) |
| Single locale on a multi-locale dataset | Spanish emails get US replacements; layout / format drift in fixtures | Pass list of locales to Faker([...]) |
| Faker.credit_card without Luhn awareness | Faker IS Luhn-valid; over-validation is wasted work | Trust Faker for cards; validate other formats |
| Using Faker output as "real" test card | Faker cards are Luhn-valid but not Stripe / Adyen test cards | Use synthetic-pii-generator for PCI-DSS-safe test cards (Stripe / Visa reserved ranges) |
| Unpinned Faker version in CI | Output drifts on upgrade; snapshot diffs break unexpectedly | Pin faker==X.Y.Z in lockfile |
Using fake.text() for malicious-input testing | Faker text is benign; doesn't cover XSS / SQLi payloads | Use malicious-payload-bank |
synthea-healthcare-data
(which simulates patient lifecycles) or a domain-specific
generator.synthetic-pii-generator.com.github.javafaker:javafaker on Maven Central.Bogus NuGet package.synthetic-pii-generator.presidio-pii-detection,
pii-masking-pipeline-builder.npx claudepluginhub testland/qa --plugin qa-test-data-privacySearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.