From qa-test-data-privacy
Build-an-X workflow that produces a PII masking pipeline spec from a source-data inventory. Walks the author through (1) classifying each field against pii-categories-reference, (2) picking a masking operator from data-masking-techniques-reference, (3) deciding pseudonymisation (reversible, in GDPR scope) vs anonymisation (irreversible, out of scope), (4) ordering the pipeline (detect → operator → audit), and (5) emitting a deployable config for Presidio + Faker + Synthea wrappers. Output is a YAML pipeline spec plus a per-field rationale table. Use after classifying a dataset's PII risk; this is the workflow that translates classification into runnable masking config.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-test-data-privacy:pii-masking-pipeline-builderThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Authoring a masking pipeline requires three classifications per
Authoring a masking pipeline requires three classifications per field (regulatory regime, operator, reversibility) and one global decision (pipeline ordering + audit hooks). This workflow produces a deployable YAML spec that downstream tools execute:
presidio-pii-detection
runs the detector.faker-synthetic-data /
synthea-healthcare-data
supply substitute values.pii-leak-critic audits the
output.Enumerate every column / field in the source dataset. For each, record:
| Column | Type | Sample value | Cardinality | Cross-table join? |
|---|---|---|---|---|
users.email | string | [email protected] | high | yes (joins events) |
users.ssn | string | 123-45-6789 | high | no |
users.dob | date | 1985-03-14 | medium | no |
users.zip | string | 02139 | low | no |
users.country | string | US | very low | no |
A schema introspector can produce the first columns; cardinality and join graph need a quick analytical pass.
Look up each column in
pii-categories-reference
and record which regulatory regime(s) apply. Include linkable
fields explicitly (NIST 800-122 §2.2).
| Column | GDPR | CPRA SPI | NIST | HIPAA | Risk |
|---|---|---|---|---|---|
users.email | ✓ | - | ✓ | ✓ #6 | direct |
users.ssn | ✓ | ✓ | ✓ | ✓ #7 | direct, high-sensitivity |
users.dob | linkable | - | linkable | ✓ #3 | linkable |
users.zip | linkable | - | linkable | ✓ #2 (sub-state) | linkable |
users.country | - | - | - | - | non-PII |
Any field marked direct OR linkable enters the masking scope. A
field marked only "linkable" still gets masked because it
identifies in combination with others (Sweeney 87% rule, see
pii-categories-reference).
Match each field to a technique in
data-masking-techniques-reference.
Decision tree:
| Column | Operator | Rationale | Reversible? |
|---|---|---|---|
users.email | Faker substitution (deterministic via hash-seed) | Joins across tables; need referential integrity | Yes (via salt vault) |
users.ssn | Tokenisation (vault) | Strict regulator scope; round-trip needed for auth | Yes (via vault) |
users.dob | Generalisation to year | Analytics needs age bracket, not exact DOB | No |
users.zip | Truncation to first 3 digits | HIPAA Safe Harbor #2 rule (>20k pop only) | No |
users.country | Pass-through | Not PII | n/a |
For each masked field, mark whether the result remains personal data under GDPR Art. 4(5):
Document the gate decision per dataset:
output_classification: pseudonymised # GDPR scope retained
gdpr_lawful_basis: Article 6(1)(f) legitimate interests
retention: 90 days
access_control: only-dev-environment-team
vs.
output_classification: anonymised
gdpr_lawful_basis: out-of-scope per Recital 26
retention: indefinite
access_control: open
The author cannot claim "anonymised" if any reversible technique is in the pipeline.
A standard order:
presidio-pii-detection
to catch embedded PII (e.g., a user-typed comment that contains
an email).pii-leak-critic before
declaring the run complete.Recommended shape - consumable by a generic pipeline runner:
pipeline:
name: users-staging-refresh
source:
type: postgres
connection: $PROD_RO_DSN
schema: public
table: users
classification:
output: pseudonymised
regimes: [gdpr, cpra, hipaa]
fields:
- column: email
operator: deterministic_substitution
provider: faker
provider_method: internet.email
seed_strategy: hash(salt + value)
salt_ref: vault://masking/users.email
- column: ssn
operator: tokenisation
vault: vault://masking/users.ssn
- column: dob
operator: generalisation
params:
granularity: year
- column: zip
operator: truncation
params:
keep_chars: 3
from: start
- column: country
operator: passthrough
free_text_columns:
- notes
- support_message
free_text_detector:
type: presidio
language: en
score_threshold: 0.45
entities: [PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, IP_ADDRESS]
on_detect: replace
audit:
sample_rows: 100
fail_on_critic_block: true
output:
type: postgres
connection: $STAGING_RW_DSN
schema: public
table: users
manifest:
write_to: s3://masking-manifests/${run_id}.json
A SaaS app refreshes its staging from prod nightly. Source has 4M users with 22 columns, 3 of which are free-text. Synthesised spec:
pipeline:
name: prod-to-staging-nightly
source: { type: postgres, table: users }
classification: { output: pseudonymised, regimes: [gdpr, cpra] }
fields:
- { column: user_id, operator: passthrough } # internal opaque ID
- { column: email, operator: deterministic_substitution,
provider: faker, provider_method: internet.email,
seed_strategy: hash(salt + value), salt_ref: vault://prod/email }
- { column: full_name, operator: substitution,
provider: faker, provider_method: name }
- { column: phone, operator: substitution,
provider: faker, provider_method: phone_number }
- { column: address_line1, operator: substitution,
provider: faker, provider_method: address }
- { column: country, operator: passthrough }
- { column: language, operator: passthrough }
- { column: created_at, operator: passthrough }
- { column: last_login_at, operator: passthrough }
- { column: signup_ip, operator: encryption,
params: { algo: fpe-ff1 }, key_ref: vault://prod/ip-fpe }
- { column: notes, operator: free_text_mask }
free_text_detector:
type: presidio
language: en
score_threshold: 0.5
on_detect: replace
audit: { sample_rows: 100, fail_on_critic_block: true }
Pipeline classification: pseudonymised (email is deterministic, IP is FPE-encrypted with key retained). The user explicitly accepts that this output remains in GDPR scope.
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Per-column operator without referential check | Joins break after masking | Group columns that share keys; apply deterministic operators consistently |
| Free-text columns skipped | Embedded PII (user-typed emails) leaks | Always run Presidio on any string column > ~50 chars |
| Claiming "anonymised" when any reversible op is in the pipeline | False GDPR compliance claim | Audit the pipeline; pseudonymised if any operator is reversible |
| No audit step | Operator failure or recogniser drift goes unnoticed | Always sample output and run pii-leak-critic |
| Salt vault key shared across pipelines | Salt-rotation breaks every downstream pipeline at once | Per-pipeline salt; rotate independently |
| No manifest | Cannot reproduce a past run; auditors can't trace lineage | Always emit manifest with version IDs |
| Pipeline runs on prod-write connection | Risk of writing masked data back over prod | Strict source = read-only DSN; output = staging-write DSN |
presidio-pii-detection
limitations).pii-categories-reference,
data-masking-techniques-reference,
presidio-pii-detection,
faker-synthetic-data,
synthea-healthcare-data.pii-leak-critic.npx claudepluginhub testland/qa --plugin qa-test-data-privacySearches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.