From video-recap-skills
Synthesizes Chinese TTS audio per segment from timestamped narration.json using MiMo TTS, with dynamic rate fitting and loudness handling. Part of the video-recap pipeline.
How this skill is triggered — by the user, by Claude, or both
Slash command
/video-recap-skills:video-voiceoverThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Reads a timestamped narration script and synthesizes one audio clip per segment, fitting speech
Reads a timestamped narration script and synthesizes one audio clip per segment, fitting speech
to each segment's time slot (dynamic rate), then records placement metadata. The only engine is
MiMo TTS (mimo-v2.5-tts).
export MIMO_API_KEY=*** # MiMo TTS (or a TTS-specific MIMO_TTS_API_KEY)
work_dir/narration.json — segments with start / end / narration (+ optional pause_after_ms,
overlaps_speech). Times are the output-timeline seconds the audio will be placed at.
In the orchestrated cut-mode flow, the agent writes narration.json directly against the output
timeline, and the orchestrator passes it here. In the legacy direct-cut path,
narration_mapped.json may be passed explicitly instead.
python3 scripts/voiceover.py --work-dir <work_dir> --narration <narration.json> [--mimo-voice 冰糖]
For direct one-off use, omitting --narration reads work_dir/narration.json.
Pass --narration work_dir/narration_mapped.json explicitly only for the legacy direct-cut path;
the video-recap orchestrator always passes narration.json.
tts_segments/*.wav — one synthesized clip per narration segment.tts_meta.json — {segments: [...], engine, narration} where each segment carries its
audio_path, timing, pause_after_ms, and placement fields consumed by video-assemble.TTS_WORKERS, TTS_TIMEOUT, TTS_RETRIES, ALLOW_PARTIAL_TTS tune throughput/robustness.npx claudepluginhub worldwonderer/video-recap-skillsGenerate professional voiceover narration for a video with audio-video sync using Azure TTS by default, or Gemini 3.1 Flash TTS when configured. Use this skill whenever the user wants to add narration, voiceover, commentary, or voice dubbing to any video file — even if they just say "add audio to this video" or "make a narrated version." Also trigger when the user has a screen recording, demo, tutorial, or presentation video that needs a voice track. Trigger on Chinese requests like "视频配音", "给视频加旁白", "录屏解说", "视频加语音", "视频添加声音", "生成视频旁白", "自动配音", "视频解说词".
Generates voiceover audio via ElevenLabs TTS API with direct curl calls, voice tuning, and sound effects. For narration, audio ducking, and multilingual production — not voice AI agents or transcription.
Generates Chinese-narration recap videos from source files. Orchestrates video understanding, narration writing, scene cutting, voiceover synthesis, and final assembly using a single MiMo API key and ffmpeg.