Skill

video-voiceover

Synthesizes Chinese TTS audio per segment from timestamped narration.json using MiMo TTS, with dynamic rate fitting and loudness handling. Part of the video-recap pipeline.

Python

ai-ml

Popularity

Stars

283

Forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/video-recap-skills:video-voiceover

Not user invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Reads a timestamped narration script and synthesizes one audio clip per segment, fitting speech

Supporting Files

scripts/lib.pyscripts/voiceover.py

SKILL.md

57 lines · ~614 tokens

Stats

LanguagePython

Stars283

Forks49

MaintenanceExcellent

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

What this does

Reads a timestamped narration script and synthesizes one audio clip per segment, fitting speech to each segment's time slot (dynamic rate), then records placement metadata. The only engine is MiMo TTS (mimo-v2.5-tts).

Requirements

export MIMO_API_KEY=***         # MiMo TTS (or a TTS-specific MIMO_TTS_API_KEY)

Input contract

work_dir/narration.json — segments with start / end / narration (+ optional pause_after_ms, overlaps_speech). Times are the output-timeline seconds the audio will be placed at. In the orchestrated cut-mode flow, the agent writes narration.json directly against the output timeline, and the orchestrator passes it here. In the legacy direct-cut path, narration_mapped.json may be passed explicitly instead.

Run

python3 scripts/voiceover.py --work-dir <work_dir> --narration <narration.json> [--mimo-voice 冰糖]

For direct one-off use, omitting --narration reads work_dir/narration.json. Pass --narration work_dir/narration_mapped.json explicitly only for the legacy direct-cut path; the video-recap orchestrator always passes narration.json.

Output contract

tts_segments/*.wav — one synthesized clip per narration segment.
tts_meta.json — {segments: [...], engine, narration} where each segment carries its audio_path, timing, pause_after_ms, and placement fields consumed by video-assemble.

Notes

Re-runs safely reuse only matching per-segment audio; edited narration or TTS settings regenerate the affected WAVs.
TTS_WORKERS, TTS_TIMEOUT, TTS_RETRIES, ALLOW_PARTIAL_TTS tune throughput/robustness.

What this skill does NOT do

Does NOT write or edit narration text.
Does NOT mux, duck, or render subtitles — that is video-assemble.
Does NOT analyze the video or choose timestamps — it voices the segments it is given.

video-voiceover

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

video-voiceover

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

What this does

Requirements

Input contract

Run

Output contract

Notes

What this skill does NOT do

Similar Skills

What this does

Requirements

Input contract

Run

Output contract

Notes

What this skill does NOT do

Similar Skills