journalistic-data-preprocessing | journalism-tools

Stats

Actions

Tags

journalistic-data-preprocessing | journalism-tools

Journalistic Data Preprocessing

Preprocessing data for journalism requires higher standards than typical data science: every transformation must be traceable, every decision documented, and the human must approve substantive changes.

Core Principles

Provenance first: Every row traces to source file, sheet, and row number
No silent transformations: Never automatically fix issues without documentation
Human-in-the-loop: Present findings and get approval before transformations
Transparency artifacts: Generate documentation a reporter could hand to an editor

Workflow Overview

1. LOAD       → Ingest data, establish provenance columns
2. AUDIT      → Systematically examine every column for issues
3. REPORT     → Present findings, proposed fixes, questions to user
4. TRANSFORM  → After approval, execute documented transformations
5. VALIDATE   → Confirm transformations, output final dataset + audit trail

Phase 1: Data Loading

Architectural Questions to Ask User

Before loading, clarify:

Structure: One table or multiple related tables?
Grain: What does each row represent?
Key fields: What uniquely identifies a record?
Time coverage: What date range does this cover?

Provenance Column Standard

Always add these columns to loaded data:

'_source_file'    # Original filename
'_source_sheet'   # Sheet name (if Excel) or 'csv'
'_source_row'     # 1-indexed row number in original file
'_load_timestamp' # When this record was loaded

Phase 2: Column Audit

Systematically examine every column.

Audit Checklist Per Column

Category	What to Check
Type	Is inferred type correct? Mixed types?
Missing	How many nulls? Pattern to missingness?
Cardinality	Unique values vs total rows
Distribution	Outliers? Impossible values?
Text quality	Encoding issues? Entity variations? Typos?
Dates	Consistent format? Future or distant past dates?
Numeric	Scale consistent? Negative where unexpected?

Phase 3: Issue Report

Generate a report for human review. See references/report-template.md for format.

Report Structure

# Data Quality Report: [Dataset Name]

## Summary
- Total rows / columns / columns with issues

## Critical Issues (Require Decision)
[Issues that could affect analysis validity]

## Warnings (Review Recommended)
[Issues that may or may not need fixing]

## Proposed Transformations
[Each transformation with rationale]

## Questions for Human Review
[Decisions that require domain knowledge]

Key Report Principles

Severity tiers: Critical > Warning > Info
Concrete examples: Show actual problematic values
Proposed fixes: Suggest specific transformations
Preserve optionality: Let human choose between approaches
Audit artifacts: Identify what documentation to produce

Phase 4: Transformations

After human approval, execute transformations with full documentation.

Phase 5: Validation & Output

Final Outputs

Clean dataset (cleaned_[name].csv) - Provenance columns preserved
Transformation log (transformation_log.csv) - Every change documented
Audit report (data_audit_report.md) - Issues, decisions, resolutions
Entity mappings (entity_mapping_[column].csv) - If standardization applied

Human Review Artifacts

Decision Type	Artifact to Generate
Entity variations	Frequency table + proposed mapping
Outliers	Distribution summary + flagged values
Missing data	Missingness by column summary
Duplicates	Sample duplicate groups

References

references/report-template.md - Full report template with examples