Skill

normalize-format

Normalizes inbox files (markdown, Claude JSON/JSONL, Readwise, transcripts, link captures) into clean markdown with partial frontmatter. Handles format-specific edge cases like content-block arrays and timestamp stripping.

automation

developer-tools

Popularity

Stars

121

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/thinking-frameworks-skills:normalize-format

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

- [Supported formats](#supported-formats)

SKILL.md

108 lines · ~1.4k tokens

Stats

Stars121

Forks16

MaintenanceExcellent

Last CommitJun 15, 2026

Actions

View Source View Plugin View on GitHub View README

Related skills: Called by ingest-inbox-item as step 1. Upstream of tag-by-topic, score-intuition-density, dedupe-against-corpus.

Supported formats

Extension	Format	Notes
`.md`, `.txt`	plain markdown	Default; passes through
`.json`	Claude.ai export	Conversation with messages array
`.jsonl`	Claude Code session	Content-block array per response
`.md` (Readwise-shaped)	Readwise export	Highlights + user notes
`.csv`	Readwise CSV	Per-row highlight
`.vtt`, `.srt`, `.md` (diarized)	Transcript	May include timestamps + speaker labels
`.md` with URL + commentary	Link capture	User's framing is the signal

Workflow

Normalize one file:
- [ ] Step 1: Detect format by extension + first-line sniff
- [ ] Step 2: Apply format-specific parse
- [ ] Step 3: Split long transcripts at topic boundaries (>3000 words)
- [ ] Step 4: Emit [{body, partial_frontmatter}, ...] list (usually one item)

Step 1: Detection

.jsonl with "type":"assistant" → Claude Code session.
.json with "conversation" / "messages" top-level key → Claude.ai export.
.md starting with # and Readwise boilerplate (**Highlights first synced by Readwise...**) → Readwise.
.vtt / .srt, or .md with [HH:MM:SS] timestamp pattern, or lines prefixed with speaker labels like Me: → transcript.
.md with ≤50 words and a prominent URL → link capture.
Else: plain markdown.

Step 2: Format-specific parse

Plain markdown: pass body through unchanged. Title = first H1 or filename-derived.

Claude.ai JSON: flatten content blocks to markdown. Preserve user/assistant turn labels (**Me:** / **Claude:**). Strip system-reminder blocks. provenance.author: claude, confidence: paraphrased.

Claude Code JSONL: flatten content-block array. Drop tool_use blocks unless the adjacent user message references the tool output. Strip system reminders.

Readwise: split per-book file into one seed per highlight. Body = highlight + user note. Boilerplate stripped. For bare highlights (no user note), set provenance.confidence: quoted, density capped at 3 downstream. For user-annotated highlights, confidence: owned.

Transcript: strip timestamps. Preserve speaker labels as **Speaker:** prefixes. If >3000 words, split at topic shifts — emit multiple outputs sharing parent_source. Target ~1500 words per chunk.

Link capture: separate URL from commentary. Body = user's commentary. Frontmatter adds source.linked_url. If <50 words of commentary, flag low_commentary: true so the scorer caps density.

Step 3: Split long transcripts

Split heuristic: paragraph break + topic-vocabulary shift (measured by tag overlap drop across adjacent paragraphs). Each chunk ~1500 words. Preserve parent_source across chunks.

Detection heuristics

A file that looks like a transcript but is actually an email thread (first-line sniff: From: ..., Date: ..., Subject: ...) — reclassify as plain markdown or link capture.
A Readwise file missing boilerplate — treat as plain markdown.
A .json file that isn't a Claude export — treat as plain markdown and wrap in code fences.

Worked example

Input (inbox/2026-04-21-claude-bnn.json):

{"conversation":{"name":"BNN variational","messages":[
  {"role":"user","content":[{"type":"text","text":"help me intuit why variational inference..."}]},
  {"role":"assistant","content":[{"type":"text","text":"Think of it as fitting a simple distribution..."}]}
]}}

Output:

# BNN variational

**Me:** help me intuit why variational inference...

**Claude:** Think of it as fitting a simple distribution...

With partial_frontmatter = {id: 2026-04-21-bnn-variational, title: "BNN variational", source: {type: claude-conversation, ...}, provenance: {author: claude, confidence: paraphrased}}.

Guardrails

Readwise CSV with malformed rows: skip the row, log WARN | malformed CSV row in <file> line N to changelog.
Claude.ai JSON schema drift: if messages key missing, fall back to recursive text extraction; mark confidence: paraphrased regardless.
Transcript that is actually an email thread: reclassify rather than keep as transcript.
Never OCR images. Image-only inbox items get a seed body of [image: awaiting user annotation] and status: dead with reason image-only.
Files >50k words: refuse and log. User splits manually.
Never lose the original. Even if parsing fails, the inbox file stays intact (caller moves to .processed/ only on success).

Quick reference

Seven supported formats, each with a specific parser.
Returns [{body, partial_frontmatter}, ...] — always a list, usually of length 1.
Long transcripts split into multiple outputs sharing parent_source.

normalize-format

Popularity

Invocation

Context Preview

SKILL.md

normalize-format

Popularity

Invocation

Context Preview

SKILL.md

Normalize Format

Table of Contents

Supported formats

Workflow

Step 1: Detection

Step 2: Format-specific parse

Step 3: Split long transcripts

Detection heuristics

Worked example

Guardrails

Quick reference

Similar Skills

Normalize Format

Table of Contents

Supported formats

Workflow

Step 1: Detection

Step 2: Format-specific parse

Step 3: Split long transcripts

Detection heuristics

Worked example

Guardrails

Quick reference

Similar Skills