From voicemode
Clones and uses custom voices for VoiceMode TTS via local mlx-audio. Handles reference clip validation, transcription, and voice routing.
How this skill is triggered — by the user, by Claude, or both
Slash command
/voicemode:impressionsThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Make VoiceMode speak in any voice. The model takes a short reference clip and synthesises fresh speech in that voice via local Qwen3-TTS on top of mlx-audio.
Make VoiceMode speak in any voice. The model takes a short reference clip and synthesises fresh speech in that voice via local Qwen3-TTS on top of mlx-audio.
Status: Preview / experimental. Apple Silicon only. Opt-in.
voice= argument in voicemode:converse doesn't match a known Kokoro voicemlx-audio service# 1. Install the local TTS service (one-time, Apple Silicon only)
voicemode service install mlx-audio
# 2. Add a voice from a reference clip
voicemode clone add fleabag ~/Downloads/fleabag-clip.wav
# 3. Use it
voicemode converse --voice fleabag
In the MCP converse tool, pass voice="fleabag" -- VoiceMode auto-routes any voice that matches a profile in VOICEMODE_VOICES_DIR to mlx-audio instead of Kokoro / OpenAI.
voicemode clone add validates the input before doing any expensive work:
default.wav.voicemode clone add auto-transcribes into voice.md (verify it -- correct mis-hearings by hand); voice-lab's sayas reads <clip>.txt next to each wav; the MCP converse tool takes ref_text alongside a clip-path voice. (Root-caused on VL-50, 2026-06-11: 1977 Doctor Who clips stammered until transcripts were supplied -- then "much better!!!".)If your source is longer than 9 seconds, trim with the same one-liner the runtime error suggests:
ffmpeg -i in.wav -ss 0 -t 8 out.wav
Voices live as directories under ~/.voicemode/voices/<name>/:
~/.voicemode/voices/fleabag/
├── default.wav # required: 3-9s of clean reference audio, mono 24kHz 16-bit PCM
└── voice.md # auto-generated by `voicemode clone add` -- name, source, duration, format, transcript
voice.md carries YAML front matter with name, source (original input path), duration_seconds, format (literal mono 24kHz 16-bit PCM, loudnorm I=-16 TP=-1.5 LRA=11), and transcript. It documents what the clip is and where it came from.
voices.json at the voices root is retained as a legacy index -- voicemode clone add writes an entry pointing at <name>/default.wav so older consumers keep working. Prefer the directory layout above for new work.
Multiple WAVs are allowed alongside default.wav; symlink whichever one is "active" to default.wav. A directory with multiple WAVs and no default.wav is treated as a sample bin and skipped.
5-9 seconds of clean conversational speech beats 30 seconds of noisy podcast audio. The model copies what it hears -- including hum, music beds, and laugh tracks. See docs/finding-samples.md for ranking heuristics, an mlx-whisper word-timestamp ranker concept, and ffmpeg loudnorm recipes.
| Variable | Default | Purpose |
|---|---|---|
VOICEMODE_VOICES_DIR | ~/.voicemode/voices | Where voice profiles live |
VOICEMODE_REMOTE_VOICES_DIR | (unset) | Path on remote mlx-audio host (path translation) |
VOICEMODE_MLX_AUDIO_BASE_URL | http://127.0.0.1:8890/v1 | OpenAI-compatible mlx-audio endpoint |
VOICEMODE_IMPRESSIONS_MODEL | mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16 | Hugging Face model ID |
The unreleased 8.7.0 candidate used VOICEMODE_CLONE_* names. They're honoured in 8.7.x with a one-shot deprecation warning and removed in 8.8.0:
| Deprecated | Use instead |
|---|---|
VOICEMODE_CLONE_BASE_URL | VOICEMODE_MLX_AUDIO_BASE_URL |
VOICEMODE_CLONE_MODEL | VOICEMODE_IMPRESSIONS_MODEL |
VOICEMODE_CLONE_PORT | VOICEMODE_MLX_AUDIO_PORT |
If you see those in a user's voicemode.env, suggest updating them.
<clip>.txt beside the wav (sayas), ref_text in converse, corrected transcript: in voice.md. See "Reference clip requirements".af_sky (or any other Kokoro voice name) shadows the Kokoro voice. Pick distinctive names like fleabag, mike-2026, bryan_morning.npx claudepluginhub mbailey/voicemode --plugin voicemodeClones voices via ElevenLabs Instant Voice Cloning pipeline: sourcing reference audio, preparing samples, uploading for IVC, testing with TTS, and tuning settings.
Generate audio content — text-to-speech, podcasts, voice cloning, sound effects, speech-to-speech, dubbing, and audio isolation. Currently powered by ElevenLabs. Works with both the Python SDK and the ElevenLabs CLI. Includes ready-to-run generator scripts that Claude writes to a temp file and executes directly. Triggers: audio, elevenlabs, text-to-speech, TTS, podcast, voice, voiceover, narration, voice clone, sound effects, dubbing, speech-to-speech, audio isolation.
Enables voice conversations with Claude Code using speech-to-text and text-to-speech. Includes setup, diagnostics, and MCP-based voice interaction.