Skill

triage-pregraph-data

Audit a dataset before it joins an identity-graph build. Enumerates dataset-specific failure modes (hub identifiers, high-degree nodes, behaviorally suspicious values, over-connected identifiers, source- specific quirks), tests every hypothesis in parallel against the data, quantifies damage by rows / edges / entities affected, proposes minimal filter expressions ranked by severity, and — when the audit finds issues — returns a validated `CREATE MATERIALIZED VIEW` NQL the caller can run to produce a graph-ready clean source. If the data passes the audit, says so plainly and recommends the source table unchanged. Plans, reports, and authors the clean-view NQL; does not execute it. Use when: "audit this dataset before the graph build", "find bad edges in <dataset>", "check identity data quality", "recommend filters for the graph build", "find hub identifiers in <dataset>", "quantify damage from <identifier_type>", "give me the clean-view NQL for the graph build", "pre-graph DQ". (narrative-identity)

Popularity

Parent stars

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/narrative-identity:triage-pregraph-data

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Supporting Files

SKILL.md.tmplreferences/EDGE_CASES.mdreferences/HARNESS_FALLBACK.md

SKILL.md

615 lines · ~7k tokens(exceeds 5k compaction limit)

Stats

LanguageGo Template

Parent stars1

MaintenanceExcellent

Last CommitMay 20, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Triage Pre-Graph Data

Persona

You are a graph-quality engineer auditing a dataset before it joins an identity-graph build. You optimize for:

Defensible thresholds — every filter you propose is tied to quantified evidence from this dataset, never to a rule from a prior build and never to intuition.
Conservative removal under transitivity — connected-components is transitive, so one bad edge can collapse thousands of distinct entities into a single giant component. You bias toward removing identifiers when the evidence supports it, and you quantify the damage radius before recommending action.
Minimal cuts — you remove bad edges while preserving as much legitimate signal as possible. Maximal filters are easy and wrong.

You never apply a threshold from a prior dataset without re-justifying it here. You never recommend a filter without quantified before/after impact. You never treat "bad" as universal — what counts as a problem depends on the identifier type, the source system, and what realistic behavior looks like for the entity being identified.

Overview

Audit a dataset that is about to be ingested into an identity-graph build. The graph is constructed by an iterative connected-components algorithm over a bipartite edge set (entity ↔ identifier). Transitivity means a single bad edge can bridge thousands of unrelated entities into one component, so data quality is the dominant lever on graph quality.

This skill produces two deliverables, packaged into a single report:

Audit findings — per-hypothesis theory, the query that tested it, the result, and (for confirmed issues) a proposed filter expression with quantified before/after impact, ordered by severity.
Recommended clean-view NQL — a validated CREATE MATERIALIZED VIEW query that applies every recommended filter to the source table and projects a graph-ready clean view. The skill authors and validates the NQL but does not execute it; running the materialization is the caller's responsibility (an operator, a pipeline, or the downstream graph-build skill).

If the audit confirms zero issues, the skill says so plainly and recommends the source table unchanged — no materialization required.

Arguments

The skill accepts optional arguments after the slash command. Parse them up front; never invent values.

Argument	Meaning
`--dataset <id>`	Pre-bind the source dataset. Skips dataset discovery.
`--entity-type person\|household\|device\|business\|account`	Anchors what "plausible" looks like for the entity being identified. Required if not derivable from context.
`--identifier-cols <col,col,...>`	Pre-declare which columns hold identifiers. Skips identifier-column discovery.
`--no-parallel`	Execute hypothesis queries serially (debugging or harness-limited).
`--no-schema`	Work from a user-pasted schema only; skip every `narrative-mcp` call.
Free-text tail	Additional context about the source system.

If invoked with no arguments, walk the user through interactively.

When to use

Triggers:

"Audit this dataset before the graph build"
"Find bad edges in <dataset>" / "check identity data quality"
"Find hub identifiers in <dataset>"
"Recommend filters for the graph build"
"Quantify damage from <identifier_type>"
"Pre-graph DQ on <source system>"

Do NOT use for:

General analytical investigation — use /design-analysis for the fuzzy-question-to-brief shape, then a query-writing skill.
Rosetta Stone attribute mapping — use /generate-rosetta-stone-mappings.
Direct query authoring or execution — use /write-nql.
Post-build graph repair — this skill is pre-graph. Once a bad edge is in the component algorithm, removal is a different kind of surgery.
Schema mutations / pipeline changes — out of scope; the deliverable here is filter expressions, not code changes.

Procedure

Run phases 1–8 in order. Phases 2, 3, 4, 6, and 8 are mandatory — do not skip to the report without sharpening the question, generating hypotheses, testing them with quantified queries, proposing evidence-tied filters, and composing the validated clean-view NQL (or explicitly confirming the source is clean and no NQL is needed).

1. Pin the company / context

If --no-schema was passed, skip this phase.

Most Narrative work is scoped to a company. Before any dataset, attribute, or workflow call:

narrative_context_get  → check the active company

If no company is set, or the user named a different one:

narrative_context_search_companies(search_term: "<name>")
narrative_context_set_company(companyId: <id>)

narrative_context_search_companies is global-admin-only. Skip the search/set entirely if the user invoked the skill from a Narrative Platform UI session where the company is implicit (narrative_context_get returns one).

2. Frame the audit — mandatory

Apply the framing discipline from /design-analysis (sharpened question, unit of analysis, population, what-this-will-NOT-answer). For a complex or unfamiliar dataset, invoke /design-analysis directly to produce the framing brief; for standard pre-graph audits, encode the framing inline using the checklist below.

Dimension	What to pin	Example
Sharpened question	"Which edges in `<dataset>` would corrupt the graph if not filtered, and what's the minimal cut that removes them?"	restate in one sentence
Entity type	What is the graph trying to identify?	person, household, device, business, account
Unit of analysis	One row = one edge	(entity_id, identifier_value, identifier_type)
Population	The source table, pre-filter	`company_data.<table>`; full row count + distinct entity count
Identifier columns in scope	Which columns produce edges	`email`, `phone_e164`, `device_id`, etc.
What this audit will NOT do	Explicit non-goals	not fixing the source; not re-building the graph; not changing entity-resolution rules

Resolve the source dataset and read schema + sample + stats:

narrative_datasets_search(search_term: "<phrase>")
narrative_datasets_describe(
  dataset_ids: [<id>],
  include: ["metadata", "schema", "sample", "stats"]
)

End this phase with a two-line summary you show the user before running any hypothesis query:

Auditing <dataset> (<row_count> rows, <entity_count> distinct entities) for pre-graph data quality. Entity type: <type>. Identifier columns: <col1>, <col2>, … . Comparison threshold for "plausible per-entity activity": <derived from entity type>.

3. Hypothesize failure modes — mandatory

Generate a dataset-specific list of failure-mode hypotheses. Do not just run down the taxonomy below — reason from how this data was collected, what the identifiers represent, and what realistic behavior looks like for the entity type. The taxonomy is a starting point, not a checklist.

Starter taxonomy (extend, do not just copy)

Failure mode	What to look for	Why it kills graph quality
Hub identifiers	A single identifier value shared across many entities — placeholders, defaults, sentinels: `NULL`, empty string, `[email protected]`, `noreply@*`, `0000000000`, `test@`, source-system sentinels.	One value forms a hub that bridges every entity that ever fell back to it.
High-degree nodes	Identifiers connected to far more entities than is plausible, even if not obviously a placeholder. Top-N by degree, plus the long tail.	Each high-degree node is a potential bridge between unrelated components.
Behaviorally suspicious	Activity volume that doesn't fit the entity type (an "individual" producing hundreds of rows per day a real person couldn't generate).	Suggests the identifier is a service / bot / system actor, not the labeled entity.
Over-connected	Same email across N customer accounts, same phone across many households. Threshold varies by identifier type: email legitimately spans 2–3 family members; device IDs generally shouldn't span more than 1; phones span 4–5 in a household.	Plausible-vs-implausible co-occurrence collapses households or accounts that should remain distinct.
Malformed identifiers	Format violations: emails without `@`, phone numbers without enough digits, UUIDs with wrong length.	Often pass through ETL silently and form their own little hub at the malformed value.
Identifier-encodes-session-not-entity	The "identifier" is actually a session_id, transaction_id, or cookie that should not persist across entities.	Edges built on these will fragment one entity across many components.
Format-changes-over-time	The identifier format changed at a known cutover (vendor change, schema migration, regex upgrade) and the same logical value now appears under two formats.	Both formats reference the same entity but build separate components; opposite of the hub problem, equally bad.
Identifiers that are labels	A field labeled `customer_id` actually stores a free-text customer name or an enum code.	The "identifier" lacks the uniqueness contract the graph builder assumes.

Look beyond this table. Think about source-specific issues — the quirk of how this particular table was assembled is often what produces the worst edges.

Before moving on, write down: the list of hypotheses for this dataset, each phrased as a falsifiable claim. Example:

H1: email = '[email protected]' appears on > 0.1% of rows and bridges > 100 distinct entities. H2: top-10 phone_e164 values are each on > 50 distinct entities. H3: rows where device_id is exactly 16 hex chars vs. 32 hex chars reference the same logical entity under two formats. H4: ...

Aim for 5–12 hypotheses on a typical dataset. Fewer than 5 usually means you stopped thinking too early; more than 12 usually means you need to consolidate.

4. Hand off hypotheses to `/design-analysis` for parallel testing — mandatory

For the hypothesis-testing workload, this skill does not write or run queries directly. Query authoring and execution belong to /design-analysis, which owns the /write-nql orchestration for the parallel batch. Your job here is to translate the hypothesis list into a brief the analyst can run. (Phase 8 is a separate, narrow exception: a single deterministic materialization-view query the skill hands directly to /write-nql in validate-only mode — see that phase for the rationale.)

Compose a single brief and pass it to /design-analysis. In the brief, mark every hypothesis as a peer-level analytical spec with no inter-hypothesis dependencies — that flags them as parallelizable, and the analyst will issue them as one concurrent batch of /write-nql calls per its Phase 6.

What each hypothesis spec must contain

For every hypothesis, the spec passed to /design-analysis includes:

Purpose — the falsifiable claim from Phase 3, restated.
Source table + grain — from Phase 2.
Filters + time window — what slice of rows this hypothesis examines.
Group-by dimensions — usually the identifier column under test.
Measures — pick the slice that fits the hypothesis (see the taxonomy below).
Expected output shape — e.g. "top-N descending by distinct entity count" or "two-row format-bucket summary."
Validation check — e.g. "row count > 0; max degree from this query should agree with the dataset's global max."
Dependency: none — explicit, so the analyst parallelizes.

What to quantify per hypothesis (pick the slice that fits):

Distinct entities per identifier value — COUNT(DISTINCT entity_id) GROUP BY identifier_value, top-N descending.
Degree distribution — top-N and the long tail (e.g., 99th, 99.9th, max percentiles).
Rows per identifier per time window — daily or hourly rate, to catch behavioral implausibility.
Format distribution — LENGTH, REGEXP_LIKE, prefix patterns; cardinality per format bucket.
Share of total rows affected — confirmed-issue rows / total rows. An issue at 0.01% is a different problem from one at 5%. Express row totals as COUNT(1), not COUNT(*) — NQL rejects COUNT(*).

Hypothesis specs pass through /design-analysis and /write-nql, both of which apply the NQL gotchas catalog (no SELECT * / COUNT(*), same-data-plane joins, no OR in JOINs, runnable queries wrapped in CREATE MATERIALIZED VIEW). Stay aware of those rules when you phrase the measures so the downstream writer doesn't have to renegotiate the spec — the long-form rules live in the shared NQL syntax snippet and in the narrative-knowledge-base MCP server (/guides/nql/troubleshooting/…).

Invocation

/design-analysis --dataset <id> --brief-only
  Test the following pre-graph data-quality hypotheses against
  <table>. Each is independent of the others; run them all in one
  parallel batch and return consolidated results.

  H1: <purpose, source, filters, group-by, measures, output, validation>
  H2: <…>
  …

If --no-parallel was passed to this skill, forward that intent to the analyst ("run hypotheses serially for harness compatibility") and flag in the audit report that the next run should be parallelized.

Wait for /design-analysis to return consolidated results — one row per hypothesis — before moving to Phase 5. Do not interpret partial results.

5. Quantify the damage — mandatory

For each confirmed hypothesis (the query returned evidence above any plausible-noise threshold), compute:

Measure	Definition
Rows affected	rows matching the failure-mode filter
Edges affected	edges (entity_id, identifier_value) matching the filter — usually = rows in a bipartite edge build, but be precise about deduplication
Distinct identifiers affected	how many identifier values trigger the rule
Distinct entities affected	how many entity_ids touch one of those identifier values
Share of total rows / edges / entities	as a percentage; useful for severity ranking
Bridge potential	for hub-style hypotheses, the size of the connected component that would form if these edges were kept

Quantify before recommending. An unquantified hunch is not a finding.

6. Propose a filter — mandatory, per confirmed issue

For each confirmed issue, propose a filter expression against the source table. Filters must:

Be defensible — the threshold ties to evidence from Phase 5, not intuition or prior datasets. Cite the rows/edges/entities numbers in the rationale.
Be minimal — remove the bad edges and as little else as possible. Maximal filters (drop the whole identifier column, drop every value with degree > 1) are usually wrong.
Be reported with impact — before/after row counts, edges removed (estimate or exact), distinct entities preserved.

Filter expression format: plain-English filter against the source table that a query writer can translate to NQL. Example:

Filter H1: exclude rows where LOWER(email) IN ('[email protected]', '[email protected]', '[email protected]'). Impact: removes 8,421 rows (0.42%); preserves 1,997,580 rows. Removes 1 hub identifier worth 8,421 edges. Rationale: each of the listed values has > 500 distinct entities, none of which co-occur with each other on any other identifier — these are sentinels, not real bridges.

If a hypothesis is disproven (the data does not support it), say so explicitly and do not propose a filter. Disproven hypotheses go in the report so the next audit doesn't re-test them blindly.

7. Draft the audit report — mandatory

Compose the audit findings section of the report. Use the template below; order by severity (rows / edges / entities affected, descending) per the user's explicit ask. The "Recommended clean-view NQL" block is filled in by Phase 8 — leave it as a placeholder for now.

# Pre-graph DQ audit: <dataset>

## Audit framing
- Dataset: `<name>` (`<id>`)
- Entity type: `<type>`
- Identifier columns audited: `<list>`
- Row count (pre-filter): `<N>`
- Distinct entities (pre-filter): `<N>`
- What this audit did NOT do: `<scope notes>`

## Headline
- Hypotheses tested: `<N>`
- Confirmed issues: `<N>`
- Disproven hypotheses: `<N>`
- Recommended filters: `<N>`
- Total rows recommended for removal: `<N>` (`<pct>`%)
- Estimated edges removed: `<N>`
- Largest hub component prevented: ~`<N>` entities

## Findings (ordered by severity, descending)

### Finding 1 — <one-line title> [SEVERITY: high|medium|low]
- **Hypothesis**: <falsifiable claim from Phase 3>
- **Query** (purpose, source + grain, filters, group-by, measures, validation):
  <plain-English brief; the actual SQL/NQL lives in the query-writer's output>
- **Result**: <numbers from Phase 5>
- **Filter**: <expression, before/after counts, rationale tied to result>

### Finding 2 — …

### Disproven hypotheses (kept for the record)
- **H<n>**: <claim>. **Result**: not supported (<evidence>). **No filter proposed.**

## Recommended clean-view NQL
<filled in by Phase 8 — either the validated CREATE MATERIALIZED VIEW
query, or the "no materialization required" note when the audit found
no issues>

The findings are the audit's diagnostic deliverable; the clean-view NQL (Phase 8) is its operational deliverable. Both ship together in the final report.

8. Compose the recommended clean-view NQL — mandatory

The caller needs an actionable artifact, not just a list of filters in prose. Phase 8 converts the consolidated filter set into a validated CREATE MATERIALIZED VIEW query that anyone (operator, pipeline, or the downstream graph-build skill) can run to produce a graph-ready clean source.

Branch on findings

Zero confirmed issues (every hypothesis disproven, or every confirmed issue has impact below the noise threshold you set in Phase 2): skip the NQL composition. In the report's "Recommended clean-view NQL" section, write exactly:

The source table passed the audit. No materialization is required; <source_table> is graph-ready as-is. Pass it to the graph build directly.

Stop. Do not invoke /write-nql.
One or more confirmed issues: continue to the next step.

Consolidate filters

Before invoking /write-nql, normalize the filter set from Phase 6:

Dedupe overlapping filters (e.g., a sentinel-email filter that subsumes a malformed-email filter). Keep the broader filter; drop the narrower one.
Phrase each filter as a row-level predicate that evaluates TRUE for rows to keep (the materialized view's WHERE clause is a positive predicate, not an exclusion list). For each Phase 6 filter expressed as "exclude rows where X", invert it to "keep rows where NOT (X)".
List every column the source carries — NQL rejects SELECT *, so the materialization must project explicit columns. Capture the column list from the Phase 2 schema read.

Hand off to `/write-nql` (validate-only)

Invoke /write-nql directly with a brief that asks for a single validated query, no execution:

/write-nql --dataset <id> --no-explain
  Author a single CREATE MATERIALIZED VIEW over <fully-qualified
  source table> that produces a graph-ready clean view. Project
  these columns explicitly: <col1>, <col2>, ... . Apply the
  following keep-predicates (combined with AND):
    K1: <predicate from Phase 6 finding 1, phrased to keep good rows>
    K2: <predicate from Phase 6 finding 2, phrased to keep good rows>
    ...
  Suggested view name: `<source_table>_graph_clean_<yyyymmdd>`.
  EXPIRE policy: short (e.g., 'P7D') — the caller can promote to a
  scheduled refresh if needed.
  Validate only. Do NOT run. Return the validated NQL verbatim.

This is the one place this skill calls /write-nql directly. The Phase 4 hypothesis testing still routes through /design-analysis because that workload is multi-query and analytical. Phase 8 is a single deterministic translation from an already-decided filter set into a CREATE MATERIALIZED VIEW — no analyst orchestration adds value, so the direct call keeps the path short.

Wait for /write-nql to return a validated query. If validation fails (e.g., a filter references a column that does not exist in the schema, or a function call doesn't compile), loop back to Phase 6, revise the offending filter, and re-hand off. Never ship an unvalidated NQL block.

Embed in the report

Replace the Phase 7 "Recommended clean-view NQL" placeholder with:

## Recommended clean-view NQL

The query below applies every recommended filter as a single
`CREATE MATERIALIZED VIEW`. **Validated against the dataset's schema
but NOT executed.** Run it (or hand it to whoever runs
materializations) to produce the graph-ready source table; then point
the graph build at the resulting view.

```sql
<verbatim NQL returned by /write-nql>
```

Estimated impact (from Phase 5, deduped across filters):
- Rows kept: `<N>` of `<total>` (`<pct>`%)
- Rows removed: `<N>` (`<pct>`%)
- Distinct entities preserved: `<N>`
- Hub components prevented: `<list of largest>`

The NQL block is the final hand-off. This skill does not run it, does not schedule it, and does not trigger the graph build.

Common cases

Person graph with email + phone identifiers

Typical hypotheses worth pre-loading (still re-justify on the data):

email IN (sentinel set) — noreply@*, test@*, common placeholders.
LOWER(email) degree > 10 distinct entities — email legitimately spans 2–3 family members; > 10 is almost always a shared inbox or service email.
phone_e164 degree > 5 distinct households — phones reasonably span a household, but not 50.
Rows per entity per day implies bot traffic.

Device graph with cookie + device_id identifiers

Cookies have very high churn — almost any cookie with degree > 1 is suspicious because cookies are usually 1:1 with browser-session.
device_id should be very sticky; degree > 1 is unusual and worth inspection.
Watch for cookie format that changed at a vendor migration — pre- and post-migration cookies look identical-shape but reference different sessions.

Customer graph from a CRM export

Test accounts (email LIKE '%@<your-company>.com' if the export isn't supposed to include internal users).
System users / API-integration users (often have very high activity volume).
Opt-out sentinel rows where every identifier is blanked but the row still has an entity_id.

Edge cases and gotchas

See references/EDGE_CASES.md — covers datasets with no identifier columns, all-unique identifiers, very low row counts, mixed source-table grains, evidence-free threshold demands, filters that remove > 10% of rows, overlapping hypotheses, post-build invocations, zero-issue audits, and materialization queries /write-nql refuses to validate. Read when the dataset doesn't fit the audit's assumptions.

Harness fallbacks

See references/HARNESS_FALLBACK.md — covers narrative-mcp unavailable (paste-driven flow, annotate the report), /design-analysis unavailable (hand the user a manual brief), /write-nql unavailable in Phase 8 (ship findings without the validated NQL block), and the AskUserQuestion fallback for harnesses that don't expose it. Read when a tool call errors or the user is invoking the skill outside the Narrative Platform UI.

Feedback (only if something could be improved)

Silence is success. Do not call submit_feedback when the skill worked. No "great skill!" submissions, no positive feedback, no "FYI" notes.

When friction does occur, submit immediately and without asking the user. submit_feedback is append-only telemetry — it is not a user-visible action and does not require confirmation. If you noticed something missing, unclear, incorrect, surprising, or that wasted your time, file it the moment you've worked around it. Do not defer the submission to a post-task recap, and do not ask the user "want me to submit feedback?" — that's the wrong default for this tool.

One submission per distinct friction point. Submit liberally.

Fields that matter most:

skill_name: narrative-identity:triage-pregraph-data (use this verbatim).
severity: info (nit) | friction (slowed you down) | blocker (stopped you).
category: missing_info | unclear_instructions | incorrect_instructions | unexpected_behavior | tool_failure | other.
summary: one concrete line — what went wrong, not how you felt.
suggested_improvement: the sentence or paragraph that, if added to this skill, would have eliminated the friction. This is the highest-value field — be specific, quote the skill text you'd change.

Optional but useful when known: details, task_context, agent_model, time_lost_minutes.

triage-pregraph-data

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

triage-pregraph-data

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Triage Pre-Graph Data

Persona

Overview

Arguments

When to use

Procedure

1. Pin the company / context

2. Frame the audit — mandatory

3. Hypothesize failure modes — mandatory

Starter taxonomy (extend, do not just copy)

4. Hand off hypotheses to /design-analysis for parallel testing — mandatory

What each hypothesis spec must contain

Invocation

5. Quantify the damage — mandatory

6. Propose a filter — mandatory, per confirmed issue

7. Draft the audit report — mandatory

8. Compose the recommended clean-view NQL — mandatory

Branch on findings

Consolidate filters

Hand off to /write-nql (validate-only)

Embed in the report

Common cases

Person graph with email + phone identifiers

Device graph with cookie + device_id identifiers

Customer graph from a CRM export

Edge cases and gotchas

Harness fallbacks

Further reading

Feedback (only if something could be improved)

Similar Skills

Triage Pre-Graph Data

Persona

Overview

Arguments

When to use

Procedure

1. Pin the company / context

2. Frame the audit — mandatory

3. Hypothesize failure modes — mandatory

Starter taxonomy (extend, do not just copy)

4. Hand off hypotheses to /design-analysis for parallel testing — mandatory

What each hypothesis spec must contain

Invocation

5. Quantify the damage — mandatory

6. Propose a filter — mandatory, per confirmed issue

7. Draft the audit report — mandatory

8. Compose the recommended clean-view NQL — mandatory

Branch on findings

Consolidate filters

Hand off to /write-nql (validate-only)

Embed in the report

Common cases

Person graph with email + phone identifiers

Device graph with cookie + device_id identifiers

Customer graph from a CRM export

Edge cases and gotchas

Harness fallbacks

Further reading

Feedback (only if something could be improved)

Similar Skills

4. Hand off hypotheses to `/design-analysis` for parallel testing — mandatory

Hand off to `/write-nql` (validate-only)

4. Hand off hypotheses to `/design-analysis` for parallel testing — mandatory

Hand off to `/write-nql` (validate-only)