From transcription-reader
Efficiently read, parse, and analyze transcription and subtitle files (STJ, VTT, SRT, ASS, SSA). Use this skill whenever the user uploads or references a transcription file, subtitle file, or meeting recording transcript. Trigger on file extensions like .stj, .stjson, .vtt, .srt, .ass, .ssa, or when the user mentions transcripts, captions, subtitles, meeting notes from recordings, or wants to summarize a recording, find who said what, extract action items, or convert between transcription formats.
How this skill is triggered — by the user, by Claude, or both
Slash command
/transcription-reader:transcription-readerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Transcription files contain timing metadata that inflates token usage without adding analytical value. This skill helps you decide the fastest path to get the content into context and work with it.
Transcription files contain timing metadata that inflates token usage without adding analytical value. This skill helps you decide the fastest path to get the content into context and work with it.
Identify the format from the file extension:
.stj, .stjson, .stj.json → STJ (Standard Transcription JSON).vtt → WebVTT.srt → SubRip.ass, .ssa → Advanced SubStation Alpha / SubStation Alpha.json (with "stj" root key) → STJIf STJ → run the extraction script immediately. Do not read the raw JSON. STJ is deeply nested with speaker ID maps, word-level timing arrays, and metadata — raw JSON is unreadable and wastes tokens. No exceptions, regardless of file size. If stjlib is not installed, install it first (pip install stjlib).
# Basic extraction
python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE
# For summarization (recommended — produces compact speaker blocks)
python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE --merge-speakers
# Quick overview
python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE --stats
For other formats, check the file size to decide your approach:
Small files (under ~200 lines / ~10KB): Just read the file directly. The overhead of running the extraction script isn't worth it — you can mentally skip the timestamps as you read. For SRT and VTT, the text is easy to parse visually. For ASS/SSA, look for Dialogue: lines.
Large files (over ~200 lines): Use the extraction script. This is where the savings matter — a 2400-line Zoom VTT becomes ~1400 lines of clean text (41% reduction, ~7K tokens saved). STJ files typically see 80-96% reduction depending on content.
python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE [--format FORMAT] [--output OUTPUT_FILE]
Read the extracted text (or the raw file if small) into context.
Analyze based on the user's request (summarize, search, extract action items, etc.).
Sometimes the user needs timestamps — for example, "at what point did they discuss budgets?" or "create a clip list." In these cases, use the --keep-timestamps flag:
python3 ${CLAUDE_SKILL_DIR}/scripts/extract_transcript.py INPUT_FILE --keep-timestamps
This produces a format like:
[00:05:12] Alice: We need to revisit the budget for Q3.
[00:05:28] Bob: Agreed, let me pull up the numbers.
Always use the extraction script for STJ, regardless of file size. Raw STJ is deeply nested JSON with speaker ID maps, word-level timing arrays, confidence scores, and metadata — it's unreadable as raw text and extremely token-heavy.
Key features available in STJ:
speakers array--min-confidence 0.8word_timing_mode is "complete" or "partial"Diarization-only STJ files: Some STJ files contain only speaker timing without transcribed text (the text field is empty). These are produced by diarization-only tools that identify who spoke when, but don't transcribe the words. The extraction script detects this automatically and emits a warning. All post-processing flags work on these files — use --merge-speakers to collapse consecutive same-speaker segments (e.g., 954 segments → ~200 merged blocks), --time-range to answer "who was talking between minute 20-30?", and --output-format jsonl for structured output. Use --stats or --list-speakers for speaker timing summaries. If you encounter a diarization-only file, let the user know it contains speaker timing but no transcribed words — they would need to run a separate transcription tool (e.g., Whisper) on the original audio to get the actual text.
For advanced STJ work (building on top of the STJ data model, custom queries, programmatic access), read references/stj-format.md for the full specification and stjlib API reference.
WebVTT files may contain speaker labels in two forms: voice tags (<v Speaker Name>text</v>) or inline labels (Speaker Name: text). Zoom recordings use the inline form. The extraction script handles both patterns.
For small VTT files, the text is readable directly — captions are separated by blank lines with a timestamp line above each one. Just skip the timestamp lines mentally.
SRT is the simplest format — numbered blocks with a timestamp line and text. Speaker information is not part of the format spec but is sometimes embedded in the text (e.g., SPEAKER 1: Hello or [ALICE] Hello).
For small SRT files, direct reading works fine — the structure is very regular and easy to scan.
ASS/SSA files have a large header (styles, metadata) followed by Dialogue: lines. The Name field in each dialogue event contains the speaker. The extraction script strips the header and style metadata, which can be 50%+ of the file.
For small ASS files, you can read directly but skip everything above the [Events] section.
The extraction script requires these Python packages:
pip install stjlib webvtt-py pysubs2
If any package is missing, the script will tell you which one to install — it gracefully handles missing optional dependencies and only requires the library for the format being processed.
After getting the transcript text into context, here are effective approaches for common tasks:
Summarize a meeting: Use --merge-speakers for the best summarization input — it produces compact speaker blocks instead of one-line-per-cue output. Identify the main topics discussed, key decisions made, and action items. Structure the summary with a brief overview followed by topic-by-topic detail.
Find what was said about a topic: Search for relevant keywords. Include surrounding context (a few lines before and after) to capture the full discussion.
List action items: Look for commitment language ("I'll do", "let's", "we need to", "action item", "TODO", "by Friday", "deadline") and extract who committed to what.
Who said what: The extracted transcript has speaker labels. Filter or group by speaker name to show each person's contributions. For large files, use --speakers-only NAME to extract just one speaker's segments.
Speaker analytics: Use --stats for duration, word counts, and speaker segment counts. Use --list-speakers for a quick overview of who's in the file.
The bundled scripts/extract_transcript.py supports these options:
Usage: extract_transcript.py INPUT_FILE [options]
Options:
--format FORMAT Force format (stj, vtt, srt, ass, ssa). Auto-detected from extension if omitted.
--output FILE Write to file instead of stdout.
--output-format FMT Output as 'text' (default) or 'jsonl' (one JSON object per line).
--keep-timestamps Include timestamps in output.
--merge-speakers Merge consecutive segments from the same speaker into single blocks.
--time-range RANGE Extract only a time range (e.g., 10:00-20:00 or 1:05:00-1:30:00).
--min-confidence N Skip segments below this confidence (STJ only, default: 0.0).
--speakers-only NAME Filter to segments from a specific speaker.
--language LANG Filter to segments in a specific language (STJ only).
--list-speakers Just list speakers found in the file, don't extract text.
--stats Show transcript statistics (duration, speaker counts, word counts).
Token-saving tips: Use --merge-speakers to consolidate short segments (a 954-segment file becomes ~200 merged blocks). Combine with --time-range to extract just the section you need. For maximum token efficiency with structured data, use --merge-speakers --output-format jsonl.
npx claudepluginhub yaniv-golan/transcription-reader-skill --plugin transcription-readerParses SRT subtitle files into structured JSON with timestamps, durations, text, and statistics. CLI tool for video subtitle analysis, text extraction, or plain text output.
Convert between 30+ caption/subtitle formats (SRT, VTT, ASS, JSON, TextGrid, LRC, FCPXML, Premiere, …) and shift timing. Trigger on "convert captions", "SRT to VTT", "转换字幕格式", "shift timing", "ASS styling", "karaoke effect", "导入Premiere", or any caption-format question. Do NOT trigger to fix timing accuracy (`/lai-align`) or translate (`/lai-translate`).
Converts raw meeting transcript .txt files into structured .md notes with metadata, TL;DR, key topics, action items, and quotes. Useful for processing meeting recordings or chat logs.