From journalism-tools
Preprocessing workflow for journalistic data analysis emphasizing transparency, provenance, and human oversight. Use when: (1) Loading messy data files (Excel, CSV, JSON) into analysis-ready format, (2) Auditing data quality before analysis, (3) Cleaning data with full transformation documentation, (4) Preparing data for investigative journalism projects. Core principle: No silent transformations—every change is documented and approved.
How this skill is triggered — by the user, by Claude, or both
Slash command
/journalism-tools:structured-data-preprocessing-journalismThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Preprocessing data for journalism requires higher standards than typical data science: every transformation must be traceable, every decision documented, and the human must approve substantive changes.
Preprocessing data for journalism requires higher standards than typical data science: every transformation must be traceable, every decision documented, and the human must approve substantive changes.
1. LOAD → Ingest data, establish provenance columns
2. AUDIT → Systematically examine every column for issues
3. REPORT → Present findings, proposed fixes, questions to user
4. TRANSFORM → After approval, execute documented transformations
5. VALIDATE → Confirm transformations, output final dataset + audit trail
Before loading, clarify:
Always add these columns to loaded data:
'_source_file' # Original filename
'_source_sheet' # Sheet name (if Excel) or 'csv'
'_source_row' # 1-indexed row number in original file
'_load_timestamp' # When this record was loaded
Systematically examine every column.
| Category | What to Check |
|---|---|
| Type | Is inferred type correct? Mixed types? |
| Missing | How many nulls? Pattern to missingness? |
| Cardinality | Unique values vs total rows |
| Distribution | Outliers? Impossible values? |
| Text quality | Encoding issues? Entity variations? Typos? |
| Dates | Consistent format? Future or distant past dates? |
| Numeric | Scale consistent? Negative where unexpected? |
Generate a report for human review. See references/report-template.md for format.
# Data Quality Report: [Dataset Name]
## Summary
- Total rows / columns / columns with issues
## Critical Issues (Require Decision)
[Issues that could affect analysis validity]
## Warnings (Review Recommended)
[Issues that may or may not need fixing]
## Proposed Transformations
[Each transformation with rationale]
## Questions for Human Review
[Decisions that require domain knowledge]
After human approval, execute transformations with full documentation.
cleaned_[name].csv) - Provenance columns preservedtransformation_log.csv) - Every change documenteddata_audit_report.md) - Issues, decisions, resolutionsentity_mapping_[column].csv) - If standardization applied| Decision Type | Artifact to Generate |
|---|---|
| Entity variations | Frequency table + proposed mapping |
| Outliers | Distribution summary + flagged values |
| Missing data | Missingness by column summary |
| Duplicates | Sample duplicate groups |
references/report-template.md - Full report template with examplesnpx claudepluginhub nhagar/claude-plugins-journalism --plugin journalism-toolsAnalyze preprocessed data for investigative journalism with full transparency. Use when a journalist has clean, preprocessed data ready for analysis and needs to identify patterns, anomalies, relationships, or statistical findings that support a story. Triggers include requests to analyze data, find patterns, identify outliers, cross-reference records, calculate statistics, or answer specific investigative questions. Complements the structured-data-preprocessing skill. Emphasizes simple, legible analyses over complex methods—every finding must be explainable to editors and defensible under scrutiny.
Writes clear, step-by-step instructions for cleaning messy datasets, specifying standardisation, correction, and removal steps for analysis readiness.
Validates CSV/TSV/Excel files and data analyses for quality, completeness, uniqueness, accuracy, consistency, outliers, and bias using qsv stats and frequency tools.