Skill

transcription-reader

Efficiently read, parse, and analyze transcription and subtitle files (STJ, VTT, SRT, ASS, SSA). Use this skill whenever the user uploads or references a transcription file, subtitle file, or meeting recording transcript. Trigger on file extensions like .stj, .stjson, .vtt, .srt, .ass, .ssa, or when the user mentions transcripts, captions, subtitles, meeting notes from recordings, or wants to summarize a recording, find who said what, extract action items, or convert between transcription formats.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/transcription-reader:transcription-reader

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Transcription files contain timing metadata that inflates token usage without adding analytical value. This skill helps you decide the fastest path to get the content into context and work with it.

Supporting Files

agents/openai.yamlreferences/stj-format.mdscripts/extract_transcript.py

SKILL.md

138 lines · ~2.2k tokens

Stats

LanguagePython

Stars0

MaintenanceExcellent

Last CommitMar 12, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

Transcription Reader

Transcription files contain timing metadata that inflates token usage without adding analytical value. This skill helps you decide the fastest path to get the content into context and work with it.

Quick Decision Tree

Identify the format from the file extension:
- .stj, .stjson, .stj.json → STJ (Standard Transcription JSON)
- .vtt → WebVTT
- .srt → SubRip
- .ass, .ssa → Advanced SubStation Alpha / SubStation Alpha
- .json (with "stj" root key) → STJ

If STJ → run the extraction script immediately. Do not read the raw JSON. STJ is deeply nested with speaker ID maps, word-level timing arrays, and metadata — raw JSON is unreadable and wastes tokens. No exceptions, regardless of file size. If stjlib is not installed, install it first (pip install stjlib).

# Basic extraction
python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE
# For summarization (recommended — produces compact speaker blocks)
python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE --merge-speakers
# Quick overview
python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE --stats

For other formats, check the file size to decide your approach:

Small files (under ~200 lines / ~10KB): Just read the file directly. The overhead of running the extraction script isn't worth it — you can mentally skip the timestamps as you read. For SRT and VTT, the text is easy to parse visually. For ASS/SSA, look for Dialogue: lines.

Large files (over ~200 lines): Use the extraction script. This is where the savings matter — a 2400-line Zoom VTT becomes ~1400 lines of clean text (41% reduction, ~7K tokens saved). STJ files typically see 80-96% reduction depending on content.
```
python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE [--format FORMAT] [--output OUTPUT_FILE]
```
Read the extracted text (or the raw file if small) into context.
Analyze based on the user's request (summarize, search, extract action items, etc.).

When Timing Data Matters

Sometimes the user needs timestamps — for example, "at what point did they discuss budgets?" or "create a clip list." In these cases, use the --keep-timestamps flag:

python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE --keep-timestamps

This produces a format like:

[00:05:12] Alice: We need to revisit the budget for Q3.
[00:05:28] Bob: Agreed, let me pull up the numbers.

Format-Specific Notes

STJ (Standard Transcription JSON)

Always use the extraction script for STJ, regardless of file size. Raw STJ is deeply nested JSON with speaker ID maps, word-level timing arrays, confidence scores, and metadata — it's unreadable as raw text and extremely token-heavy.

Key features available in STJ:

Speaker names (not just IDs) — mapped from the speakers array
Language per segment — useful for multilingual recordings
Confidence scores — can filter out low-confidence segments with --min-confidence 0.8
Word-level timing — available when word_timing_mode is "complete" or "partial"

Diarization-only STJ files: Some STJ files contain only speaker timing without transcribed text (the text field is empty). These are produced by diarization-only tools that identify who spoke when, but don't transcribe the words. The extraction script detects this automatically and emits a warning. All post-processing flags work on these files — use --merge-speakers to collapse consecutive same-speaker segments (e.g., 954 segments → ~200 merged blocks), --time-range to answer "who was talking between minute 20-30?", and --output-format jsonl for structured output. Use --stats or --list-speakers for speaker timing summaries. If you encounter a diarization-only file, let the user know it contains speaker timing but no transcribed words — they would need to run a separate transcription tool (e.g., Whisper) on the original audio to get the actual text.

For advanced STJ work (building on top of the STJ data model, custom queries, programmatic access), read references/stj-format.md for the full specification and stjlib API reference.

WebVTT (.vtt)

WebVTT files may contain speaker labels in two forms: voice tags (<v Speaker Name>text</v>) or inline labels (Speaker Name: text). Zoom recordings use the inline form. The extraction script handles both patterns.

For small VTT files, the text is readable directly — captions are separated by blank lines with a timestamp line above each one. Just skip the timestamp lines mentally.

SRT (.srt)

SRT is the simplest format — numbered blocks with a timestamp line and text. Speaker information is not part of the format spec but is sometimes embedded in the text (e.g., SPEAKER 1: Hello or [ALICE] Hello).

For small SRT files, direct reading works fine — the structure is very regular and easy to scan.

ASS/SSA (.ass, .ssa)

ASS/SSA files have a large header (styles, metadata) followed by Dialogue: lines. The Name field in each dialogue event contains the speaker. The extraction script strips the header and style metadata, which can be 50%+ of the file.

For small ASS files, you can read directly but skip everything above the [Events] section.

Installation

The extraction script requires these Python packages:

pip install stjlib webvtt-py pysubs2

If any package is missing, the script will tell you which one to install — it gracefully handles missing optional dependencies and only requires the library for the format being processed.

Common Analysis Patterns

After getting the transcript text into context, here are effective approaches for common tasks:

Summarize a meeting: Use --merge-speakers for the best summarization input — it produces compact speaker blocks instead of one-line-per-cue output. Identify the main topics discussed, key decisions made, and action items. Structure the summary with a brief overview followed by topic-by-topic detail.

Find what was said about a topic: Search for relevant keywords. Include surrounding context (a few lines before and after) to capture the full discussion.

List action items: Look for commitment language ("I'll do", "let's", "we need to", "action item", "TODO", "by Friday", "deadline") and extract who committed to what.

Who said what: The extracted transcript has speaker labels. Filter or group by speaker name to show each person's contributions. For large files, use --speakers-only NAME to extract just one speaker's segments.

Speaker analytics: Use --stats for duration, word counts, and speaker segment counts. Use --list-speakers for a quick overview of who's in the file.

Script Reference

The bundled scripts/extract_transcript.py supports these options:

Usage: extract_transcript.py INPUT_FILE [options]

Options:
  --format FORMAT        Force format (stj, vtt, srt, ass, ssa). Auto-detected from extension if omitted.
  --output FILE          Write to file instead of stdout.
  --output-format FMT    Output as 'text' (default) or 'jsonl' (one JSON object per line).
  --keep-timestamps      Include timestamps in output.
  --merge-speakers       Merge consecutive segments from the same speaker into single blocks.
  --time-range RANGE     Extract only a time range (e.g., 10:00-20:00 or 1:05:00-1:30:00).
  --min-confidence N     Skip segments below this confidence (STJ only, default: 0.0).
  --speakers-only NAME   Filter to segments from a specific speaker.
  --language LANG        Filter to segments in a specific language (STJ only).
  --list-speakers        Just list speakers found in the file, don't extract text.
  --stats                Show transcript statistics (duration, speaker counts, word counts).

Token-saving tips: Use --merge-speakers to consolidate short segments (a 954-segment file becomes ~200 merged blocks). Combine with --time-range to extract just the section you need. For maximum token efficiency with structured data, use --merge-speakers --output-format jsonl.

transcription-reader

Invocation

Context Preview

Supporting Files

SKILL.md

transcription-reader

Invocation

Context Preview

Supporting Files

SKILL.md

Transcription Reader

Quick Decision Tree

When Timing Data Matters

Format-Specific Notes

STJ (Standard Transcription JSON)

WebVTT (.vtt)

SRT (.srt)

ASS/SSA (.ass, .ssa)

Installation

Common Analysis Patterns

Script Reference

Similar Skills

Transcription Reader

Quick Decision Tree

When Timing Data Matters

Format-Specific Notes

STJ (Standard Transcription JSON)

WebVTT (.vtt)

SRT (.srt)

ASS/SSA (.ass, .ssa)

Installation

Common Analysis Patterns

Script Reference

Similar Skills