From qa-test-data-privacy
Pure-reference catalog of data-masking techniques and de-identification privacy models. Enumerates the seven canonical masking operators (substitution, shuffling, number/date variance, encryption, hashing, nulling, masking-out / character-scrambling) plus tokenisation, redaction, format-preserving encryption, and Microsoft Presidio's six built-in operators. Distinguishes reversible techniques (pseudonymisation candidates per GDPR Art. 4(5)) from irreversible techniques (anonymisation candidates). Maps techniques to NIST SP 800-188 privacy models - k-anonymity, l-diversity, t-closeness, differential privacy. Cites ISO/IEC 20889:2018 for the standard taxonomy. Use to pick the right masking operator per field type and risk level.
How this skill is triggered — by the user, by Claude, or both
Slash command
/qa-test-data-privacy:data-masking-techniques-referenceThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Masking is the act of transforming a real value into a substitute
Masking is the act of transforming a real value into a substitute that breaks the link to the original subject while preserving testable properties (format, distribution, referential integrity). Which technique is correct depends on three things: whether the result must be reversible, whether the field is referentially shared across tables, and what privacy model the dataset must satisfy.
This skill is the pure reference that the pipeline builder
(pii-masking-pipeline-builder)
and the leak critic
(pii-leak-critic) draw from to
choose operators per field.
pii-categories-reference
classified as PII).Drawing from the Wikipedia data-masking taxonomy (en.wikipedia.org/wiki/Data_masking) and ISO/IEC 20889:2018 (cite by stable ID; standard text behind paywall):
Replace the real value with an authentic-looking value from a lookup table - "John Smith" → "Maria Garcia."
hash(real_id) → fake_id keeps joins intact across tables).replace operator
(microsoft.github.io/presidio/anonymizer),
Faker library generators
(faker-synthetic-data).Randomly rearrange values within a column - salaries column gets shuffled, each row keeps a real salary but no longer the right person's salary.
Apply a bounded random offset: salary ± 10 %, dates ± 120 days (Wikipedia data-masking page).
Apply a cryptographic algorithm with a key. Two sub-variants:
General encryption (AES-256-GCM, etc.) - output is opaque ciphertext; reversible only with the key. Use for fields that must round-trip back to plaintext for authorised consumers.
Format-preserving encryption (FPE) (FF1 / FF3 per NIST SP 800-38G) - output has the same format as input (16-digit card → 16-digit ciphertext). Use when legacy systems validate format.
Reversibility: Reversible (key required).
Use for: PII that must round-trip for authorised business logic; legacy-format requirements.
Apply a one-way hash (SHA-256 / SHA-512) with optional salt.
hash operator with hash_type =
"sha256" or "sha512" and salt parameter.Replace the value with NULL or remove the column entirely.
Show partial value - credit card "**** **** **** 1234," email "j***@example.com."
mask operator with chars_to_mask,
masking_char, from_end parameters.Replace the real value with a token (random opaque string) and store the real-value → token map in a separate, access-controlled vault.
Remove the value entirely (no placeholder, no length signal).
redact operator (no parameters).Replace with a synthetically generated value preserving
distribution / format
(faker-synthetic-data;
synthea-healthcare-data
for health records).
Per microsoft.github.io/presidio/anonymizer, the Presidio Anonymizer engine supports six built-in operators:
| Operator | Parameters | Reversible | Maps to canonical technique |
|---|---|---|---|
replace | new_value (defaults to <entity_type>) | No (random) / Yes (deterministic substitution) | #1 Substitution |
redact | - | No | Redaction |
mask | chars_to_mask, masking_char, from_end | No | #7 Masking-out |
hash | hash_type (sha256 / sha512), salt | No (one-way) | #5 Hashing |
encrypt | key | Yes (with key) | #4 Encryption |
custom | lambda | Depends on lambda | (caller-defined) |
Invocation: engine.anonymize(text=, analyzer_results=, operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})}).
OperatorConfig constructor signature: OperatorConfig(operator_name, params={}) (Presidio docs).
GDPR Art. 4(5) defines pseudonymisation as "processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately" (gdpr-info.eu/art-4-gdpr/).
| Technique | Pseudonymisation? | Anonymisation? |
|---|---|---|
| Deterministic substitution (same input → same output) | ✓ | - |
| Random substitution | - | ✓ |
| Shuffling | - | ✓ (when distribution-only) |
| Number / date variance | - | ✓ if variance ≥ identifying granularity |
| General encryption (key kept) | ✓ | - |
| FPE (key kept) | ✓ | - |
| Salted hashing (salt kept separately) | ✓ | - |
| Unsalted hashing of low-entropy field | ✗ (re-identifiable by enumeration) | ✗ |
| Nulling | - | ✓ |
| Masking-out (partial) | depends on revealed chars | depends |
| Tokenisation (vault kept) | ✓ | - |
| Tokenisation + vault destroyed | - | ✓ |
| Redaction | - | ✓ |
| Synthetic substitution | - | ✓ |
Implication: A "masking pipeline" output that uses reversible techniques is still personal data under GDPR - it remains in scope. Only fully irreversible output is out of GDPR scope per Recital 26.
NIST SP 800-188:2023 ("De-Identifying Government Datasets", csrc.nist.gov/pubs/sp/800/188/final) formalises three statistical privacy models layered above the techniques above:
A dataset is k-anonymous if every record is indistinguishable from at least k − 1 other records when projected on the quasi-identifiers (Sweeney 2002, cited in NIST 800-188).
k: Typical values are k = 5, k = 10, k = 100
depending on dataset size + risk tolerance.Strengthens k-anonymity by requiring at least l well-represented values of the sensitive attribute within each equivalence class (Machanavajjhala et al. 2007).
Strengthens l-diversity by requiring the distribution of the sensitive attribute in each equivalence class be close (within t, by Earth Mover's Distance) to the distribution in the overall dataset (Li et al. 2007).
A formal mathematical guarantee: the probability of any output changes by at most a multiplicative factor (e^ε) when a single record is added/removed. ε (epsilon) is the privacy budget - lower ε = stronger privacy.
| Field characteristic | Recommended technique | Privacy model layer |
|---|---|---|
| Must round-trip for authorised consumer (payment processing) | Tokenisation (vault) or FPE | none (reversible) |
| Must join across tables, opaque value OK | Deterministic substitution / salted hashing | k-anonymity on quasi-identifiers |
| Free-text PII inside a log line | Redaction or replace-with-<TYPE> (Presidio analyzer + anonymizer) | - |
| Continuous numeric for analytics | Number variance | t-closeness if sensitive attribute |
| Categorical demographic (race, etc.) for analytics | Generalisation + l-diversity | l-diversity |
| Statistical query release | Differential privacy mechanism | DP |
| Demo / training, no analytics utility needed | Synthetic substitution (Faker / Synthea) | n/a (no real data) |
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Unsalted hashing of SSN | SSN format is enumerable (~10⁹); attacker rebuilds the mapping table in minutes. | Salt + key per tenant; or tokenise via vault. |
| FPE for an analytics dataset | Format preservation lets a join attack with another dataset recover identity. | Use random substitution for analytics datasets that don't need format round-trip. |
| "GDPR-compliant" pseudonymisation claim | GDPR pseudonymised data is still personal data - Article 4(5) is explicit. | Either mark output pseudonymised (in scope) or fully anonymise (out of scope). |
| k = 2 anonymity | Re-identification probability is 50 % for the equivalence class. | k ≥ 5 typical; k = 10+ for high-risk datasets. |
| Shuffling a rare-value column | Outliers identify themselves regardless of position. | Combine shuffling with generalisation or suppression of outliers. |
| Number variance ± 1 % on salaries | The variance is smaller than the precision needed to identify; effectively no masking. | Variance must exceed the identifying granularity - ± 10 % minimum for salary. |
| Tokenisation without vault access controls | The vault becomes the single point of failure. | Strict access control + audit logging + separate key custody. |
| Differential privacy with ε = 100 | Useless budget; no privacy guarantee. | ε ≤ 1 typical for strong privacy; ε ≤ 10 for relaxed cases. |
pii-masking-pipeline-builder).pii-categories-reference,
presidio-pii-detection.Searches MemPalace before answering questions about past work, people, projects, or prior decisions. Returns verbatim stored content instead of guessing from model memory.
Guides Payload CMS config (payload.config.ts), collections, fields, hooks, access control, APIs. Debugs validation errors, security, relationships, queries, transactions, hook behavior.
Implements vector databases with Pinecone, Weaviate, Qdrant, Milvus, pgvector for semantic search, RAG, recommendations, and similarity systems. Optimizes embeddings, indexing, and hybrid search.
npx claudepluginhub testland/qa --plugin qa-test-data-privacy