From lattifai-skills
Transcribe audio/video to timestamped captions with Gemini (100+ languages) or local Parakeet / SenseVoice models. Trigger on "transcribe", "speech to text", "转录", "语音转文字", "generate captions from audio", or when the user provides an audio/video file with no text. If the YouTube video already has captions, prefer `/lai-youtube`.
How this skill is triggered — by the user, by Claude, or both
Slash command
/lattifai-skills:lai-transcribeThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generates timestamped text from audio/video. Default is Gemini (fast, broad language coverage); local models run offline on GPU.
Generates timestamped text from audio/video. Default is Gemini (fast, broad language coverage); local models run offline on GPU.
Gemini needs an API key (free at https://aistudio.google.com/apikey):
lai config set GEMINI_API_KEY <your-key>
Pick a <base> (media stem or YouTube ID) and reuse for the rest of the pipeline; outputs land in the current directory:
# <base> = podcast (from podcast.mp3)
lai transcribe run podcast.mp3 podcast.transcript.json
# shortcut:
lai-transcribe podcast.mp3 podcast.transcript.json
Gemini accepts YouTube URLs directly — no download needed:
# <base> = la0CaZ2R8EY (the YouTube video ID)
lai transcribe run "https://youtu.be/la0CaZ2R8EY" la0CaZ2R8EY.transcript.json
Output naming: prefer <base>.transcript.json so it pipes cleanly into /lai-align (which writes <base>.aligned.json). Use <base>.srt etc. when the transcript itself is the final deliverable and no alignment step follows.
| Model | Languages | Requires |
|---|---|---|
gemini-3-flash-preview (default) | 100+ | Gemini API key |
gemini-3.1-pro-preview | 100+, highest quality | Gemini API key |
nvidia/parakeet-tdt-0.6b-v3 | 24, offline | GPU + nemo_toolkit |
FunAudioLLM/SenseVoiceSmall | zh / en / ja / ko / cantonese, offline | GPU |
Switch model:
lai transcribe run audio.mp4 output.srt transcription.model_name=gemini-3.1-pro-preview
transcription.language=zh — force language (otherwise auto-detect)media.streaming_chunk_secs=300 — chunk long audio.srt / .vtt / .ass / .json / .txt. Use .json when you plan to follow up with /lai-align.| Problem | Fix |
|---|---|
GEMINI_API_KEY not set | lai config set GEMINI_API_KEY <your-key> |
| Upload timeout / file >2 GB | Split the audio or switch to a local model |
| Wrong language detected | Force with transcription.language=en |
| Timestamps are coarse | Follow up with /lai-align |
/lai-align — sharpen timestamps after transcription/lai-diarize — add speaker labels/lai-translate — translate the transcript/lai-youtube — YouTube end-to-end (download + caption + align)/lai-caption — convert output formatnpx claudepluginhub lattifai/lattifai-skills --plugin lattifai-skillsTranscribes audio/video from YouTube URLs or local files to structured markdown with timestamps, speaker labels, and chapters using Google Gemini API.
Align existing captions to audio/video with word-level precision using the Lattice-1 model. Trigger when the user has both a media file AND a caption/transcript that need to be synchronized, or says "fix caption timing", "字幕对不上", "对齐字幕", "word-level timestamps", "karaoke timing", "timestamps are off". Do NOT trigger without existing text — use `/lai-transcribe` first.
Generates SRT/VTT subtitles and plain text transcripts from video or audio files using AWS Transcribe and ffmpeg. Useful for captions, extracting speech, notes, or searchable content.