Writes clear, step-by-step instructions for cleaning messy datasets, specifying standardisation, correction, and removal steps for analysis readiness.
How this skill is triggered — by the user, by Claude, or both
Slash command
/autopunk-media-skills:data-cleaning-briefThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Writes clear, step-by-step instructions for cleaning a messy or inconsistent dataset — specifying exactly what needs to be standardised, corrected, or removed to make the data ready for analysis and publication.
Writes clear, step-by-step instructions for cleaning a messy or inconsistent dataset — specifying exactly what needs to be standardised, corrected, or removed to make the data ready for analysis and publication.
Required: A description of the dataset and the problems you can see in it — specific examples of inconsistent values, formatting problems, missing data, or structural issues. Column names and a small sample of the messy rows. Optional: The intended analysis goal (what you will do with the data once it is clean); any cleaning decisions that have already been made; the tool the analyst will use (Excel, Python, R, SQL); the deadline.
A structured brief (300–500 words) with two sections: (1) Cleaning checklist — numbered steps in recommended order, each naming the column affected, describing the problem, and specifying the action to take. (2) Decisions requiring editorial sign-off — a separate list of any steps where reasonable people could disagree on the right approach, with the options and recommended default. Plain language throughout; minimal technical jargon, but tool-specific syntax can be included in parentheses where helpful.
Dataset: Procurement contracts database, 48,000 rows, obtained via FOI. Problems I can see in the data:
contractor_name field has inconsistent capitalisation: "BRIDGEPOINT CONSTRUCTION", "Bridgepoint Construction", "bridgepoint construction ltd" all appear to be the same companycontract_value field mixes formats: some entries are "€4,200,000", others are "4200000", others are "4.2M"award_date field has multiple date formats: "15/03/2024", "2024-03-15", "March 15 2024"contract_value is blankcontract_category field uses two different classification systems mixed together: some rows use a national category code, others use an EU CPV code
Goal: Analysis of corporate concentration — which companies are winning the most by value.Cleaning Checklist
Complete steps in this order. Do not change column names unless specified.
Remove duplicate rows (contract_id column)
Problem: 87 rows share identical values in contract_id, awarding_authority, contractor_name, and award_date.
Action: Deduplicate keeping one row per unique contract_id. Where contract_id is the same but any other field differs, flag for manual review rather than automatically deleting.
Expected result: Row count reduces from ~48,000 to approximately 47,913.
Standardise award_date to ISO 8601 (award_date column)
Problem: Three date formats in use: DD/MM/YYYY, YYYY-MM-DD, "Month DD YYYY".
Action: Convert all values to YYYY-MM-DD. Assume DD/MM/YYYY for ambiguous dates (e.g. "01/03/2024" = 1 March, not 3 January). Flag any dates where day/month are both ≤ 12 and format is ambiguous.
Standardise contract_value to numeric (contract_value column)
Problem: Values in three formats: "€4,200,000", "4200000", "4.2M".
Action: Strip currency symbols and commas. Convert "M" suffix notation to full number (4.2M → 4200000). Store as integer. Do not add decimal places.
Handle blank contract_value rows
Problem: 340 rows have no contract value.
Action: Do not impute or estimate. Leave as null/blank. Document the count (340 rows) and percentage (~0.7%) in the methodology note. Do not exclude these rows from the dataset — they may still be useful for non-value analyses.
Standardise contractor_name for grouping (contractor_name column)
Problem: Same company appears in multiple capitalisation variants.
Action: Convert all values to title case as a first pass (e.g. "BRIDGEPOINT CONSTRUCTION" → "Bridgepoint Construction"). Then run a deduplication check for near-matches (Levenshtein distance ≤ 2) and produce a review list of probable duplicates for manual confirmation. Do not automatically merge names — present the list for journalist sign-off.
Flag contract_category classification system (contract_category column)
Problem: National category codes and EU CPV codes are mixed in a single column with no flag indicating which system applies.
Action: Do not clean or merge these yet. Add a new column category_system with values "national" or "cpv" based on the pattern of the code (CPV codes follow a numeric XXXXXXXX format; national codes may differ). Preserve the original column unchanged.
Note: Analysis using this column will need to treat the two systems separately unless a mapping table is available.
Decisions Requiring Editorial Sign-Off
npx claudepluginhub ur-grue/autopunk-media-skills --plugin autopunk-media-skillsPlans safe data cleaning workflows: deduplication, missing-value handling, anomaly detection, and reproducible output generation without modifying raw files.
Preprocessing workflow for journalistic data analysis emphasizing transparency, provenance, and human oversight. Use when: (1) Loading messy data files (Excel, CSV, JSON) into analysis-ready format, (2) Auditing data quality before analysis, (3) Cleaning data with full transformation documentation, (4) Preparing data for investigative journalism projects. Core principle: No silent transformations—every change is documented and approved.
Provides patterns for pre-model data preprocessing pipelines covering cleaning, validation, transformation, ETL orchestration. Use for raw data deduplication, schema validation, format conversion before EDA or modeling.