Triage, standardize, and validate a CSV, Excel, or JSON list of companies or contacts before enrichment. Profile fill rate and cardinality, normalize company names and domains, validate emails and phones, tag invalid rows non-destructively. Run before match-business or match-prospects so you do not pay to enrich noreply, disposable, or empty rows.
How this skill is triggered — by the user, by Claude, or both
Slash command
/explorium-public-skills:clean-dataThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Per-row cleanup on a GTM list before any match or enrich call. Profile, standardize, validate. Never destructive: raw input is preserved, invalid rows are tagged with reason codes rather than deleted. Out of scope: deduplication and canonical entity resolution.
Per-row cleanup on a GTM list before any match or enrich call. Profile, standardize, validate. Never destructive: raw input is preserved, invalid rows are tagged with reason codes rather than deleted. Out of scope: deduplication and canonical entity resolution.
$ARGUMENTS is a path to a CSV, Excel, or JSON file. Parse the user message for optional sub-inputs:
Example phrasings:
Copy raw input. Before any transform, copy the input to ./00_raw/<filename> and never write back. All transforms write to numbered phase folders: 01_profiled/, 02_standardized/, 03_validated/. The single most common failure mode in cleanup work is destructive transforms with no path back.
Profile the file. Compute fill rate, cardinality, top values, length distribution, and format-pattern frequency for every column. Save the snapshot. Read it before deciding what to clean.
import pandas as pd
df = pd.read_csv(input_path)
profile = pd.DataFrame({
"fill_rate_pct": (df.notna().mean() * 100).round(1),
"cardinality": df.nunique(),
"top_value": df.apply(lambda c: c.dropna().astype(str).mode().iloc[0] if c.dropna().size else None),
"avg_len": df.apply(lambda c: c.dropna().astype(str).str.len().mean()),
})
Things to look for: country columns with 200+ distinct values (standardization problem, build ISO Alpha-2 lookup); phone columns where under 50% parse as E.164 (need country hint); company-name 99th-percentile length above 100 chars (pasted addresses, quarantine); free-email providers in the top 5 of email column (decide policy now); fields under 10% fill (probably not worth normalizing); literal strings "NA", "N/A", "None", "null", "-" (collapse to real nulls before validating).
Standardize string fields. Run standardization BEFORE validation: a valid email like [email protected] fails naive regex without trim+lowercase first. For every string column do Unicode NFKC, trim, collapse internal whitespace, strip leading and trailing punctuation, collapse null-token strings to real nulls. Then field-specific:
Inc, LLC, Ltd, GmbH, S.A., 株式会社) at end of string only. Use cleanco if available. Keep BOTH raw and normalized columns.www. Fold to the eTLD+1 via tldextract. Flag free-email providers and disposable domains separately.nameparser: honorifics, generational suffixes, credentials, particles. If confidence is low, store the raw string with a low-confidence flag.phonenumbers. Hint country from the country column when available.United States to US, UK to GB, Deutschland to DE). Reusable downstream for country filters.libpostal if installed. Country-aware parsing.Common mistake: overwriting the display column with the normalized version. Always keep raw alongside normalized.
Validate field-by-field. Per field, add a boolean <field>_valid and a <field>_reason text column when invalid. Tag invalid rows; never delete them.
email-validator; role-address detection (info@, sales@, noreply@, support@, hello@); disposable-domain check; free-provider flag (gmail, yahoo, qq); optional MX-record check (off by default).phonenumbers. Tag invalid_too_short, invalid_country, invalid_format.Hand off to entity resolution (optional). This skill cleans rows in isolation; it cannot tell you that Starbucks EMEA and Starbucks Corporation point to the same company. If the user wants the handoff, produce a match-ready subset and route rows by available signal:
Filter out tagged-invalid rows before the handoff so you do not spend credits matching [email protected] or disposable addresses. The returned IDs become the join keys for any later enrich a business or enrich a prospect call.
Per column: fill_rate_pct, cardinality, top_value, avg_len. Markdown table. After cleanup, re-run the profile and show before vs after on touched columns.
Per normalized field: raw column name, normalized column name, 3 to 5 example transformations (" ACME, Inc. " to acme, "WWW.Acme.COM" to acme.com, "+1 (415) 555 1212" to +14155551212).
Per validated field: counts of valid, invalid, risky. Frequency table of reason codes (e.g. role_address: 42, disposable_domain: 18, invalid_syntax: 6).
A single CSV at ./03_validated/<input_name>_clean.csv with all original columns plus <field>_norm, <field>_valid, and <field>_reason columns.
A second CSV at ./04_match_ready/<input_name>_for_match.csv containing only rows that passed validation, plus a one-line summary of which match path each row subset should route to (count by path).
meta.com vs instagram.com vs whatsapp.com). Resolve via company-hierarchies enrichment after matching.Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub explorium-ai/public-skills --plugin explorium-public-skills