From synthetic-data
Anonymise a real dataset by replacing PII columns with realistic Faker values, preserving structure and referential integrity.
How this skill is triggered — by the user, by Claude, or both
Slash command
/synthetic-data:replace-piiThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Anonymise a real dataset in place by replacing specified PII columns (name, email, phone, address, SSN, IP, DoB) with realistic fake values, while preserving non-PII columns and row count. Uses deterministic mapping to ensure repeated values (e.g., the same customer name appearing 5 times) map to the same fake name.
Anonymise a real dataset in place by replacing specified PII columns (name, email, phone, address, SSN, IP, DoB) with realistic fake values, while preserving non-PII columns and row count. Uses deterministic mapping to ensure repeated values (e.g., the same customer name appearing 5 times) map to the same fake name.
{"name": "name", "email": "email", "phone": "phone_number"})en_US; e.g., de_DE, fr_FR, ja_JP)./synthetic-data-workspace/outputs/)Install Faker:
pip install faker
Write a PII replacement script:
import pandas as pd
import hashlib
from faker import Faker
def anonymise_pii(input_path, output_path, pii_columns, locale='en_US'):
"""
Replace PII columns with faker-generated values.
pii_columns: dict of {col_name: faker_provider}
E.g. {'name': 'name', 'email': 'email', 'phone': 'phone_number',
'address': 'address', 'ssn': 'ssn', 'dob': 'date_of_birth'}
"""
fake = Faker(locale)
df = pd.read_csv(input_path)
# Deterministic mapping: hash(real_value) → seed → fake_value
# Ensures same real PII → same fake PII across rows
mapping = {}
for col_name, provider in pii_columns.items():
if col_name not in df.columns:
print(f"Warning: column {col_name} not found, skipping")
continue
# Build mapping for this column
col_mapping = {}
for real_value in df[col_name].dropna().unique():
real_value_str = str(real_value)
# Deterministic seed from hash
hash_int = int(hashlib.md5(real_value_str.encode()).hexdigest(), 16)
seed = hash_int % (2**31)
# Generate fake value with that seed
Faker.seed(seed)
fake_temp = Faker(locale)
fake_temp.seed_instance(seed)
# Get the fake value
fake_value = getattr(fake_temp, provider)()
col_mapping[real_value] = fake_value
# Apply mapping to column
df[col_name] = df[col_name].map(col_mapping)
mapping[col_name] = col_mapping
print(f"Replaced {col_name}: {len(col_mapping)} unique values")
# Save anonymised data
df.to_csv(output_path, index=False)
print(f"Anonymised data saved to {output_path}")
print(df.head())
return df, mapping
if __name__ == '__main__':
pii_cols = {
'customer_name': 'name',
'email': 'email',
'phone': 'phone_number',
'address': 'address'
}
anonymise_pii('real_data.csv', 'anonymised_data.csv', pii_cols, locale='de_DE')
Run the script:
python replace_pii.py
Verify output:
head -5 anonymised_data.csv
# Check that PII columns are replaced and non-PII columns unchanged
Optional: Check referential integrity:
# Verify that the same original PII value always maps to the same fake
original_df = pd.read_csv('real_data.csv')
anon_df = pd.read_csv('anonymised_data.csv')
for col in pii_cols.keys():
print(f"{col}: All rows with same real value mapped to same fake?")
# This is guaranteed by deterministic seed, but can spot-check
npx claudepluginhub danielrosehill/claude-code-plugins --plugin synthetic-dataSearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.