Skill

impressions

Clones and uses custom voices for VoiceMode TTS via local mlx-audio. Handles reference clip validation, transcription, and voice routing.

ai-ml

Popularity

Stars

1,229

Forks

173

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/voicemode:impressions

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Make VoiceMode speak in any voice. The model takes a short reference clip and synthesises fresh speech in that voice via local Qwen3-TTS on top of mlx-audio.

SKILL.md

110 lines · ~1.7k tokens

Stats

LanguagePython

Stars1,229

Forks173

MaintenanceExcellent

Last CommitJun 17, 2026

Actions

View Source View Plugin View on GitHub View README

Impressions

Make VoiceMode speak in any voice. The model takes a short reference clip and synthesises fresh speech in that voice via local Qwen3-TTS on top of mlx-audio.

Status: Preview / experimental. Apple Silicon only. Opt-in.

When to use this skill

User asks for "voice cloning", "do an impression", "speak as X", "add my voice"
A voice= argument in voicemode:converse doesn't match a known Kokoro voice
User wants to install or troubleshoot the mlx-audio service
User asks how to configure a remote mlx-audio server

Quick start

# 1. Install the local TTS service (one-time, Apple Silicon only)
voicemode service install mlx-audio

# 2. Add a voice from a reference clip
voicemode clone add fleabag ~/Downloads/fleabag-clip.wav

# 3. Use it
voicemode converse --voice fleabag

In the MCP converse tool, pass voice="fleabag" -- VoiceMode auto-routes any voice that matches a profile in VOICEMODE_VOICES_DIR to mlx-audio instead of Kokoro / OpenAI.

Reference clip requirements

voicemode clone add validates the input before doing any expensive work:

Duration: 3-9 seconds (5-9s sweet spot). Clips outside this window are rejected with an actionable error.
Mono speech, no music or cross-talk. The model copies what it hears -- including hum, music beds, laugh tracks, and overlapping speakers.
Any input format accepted. WAV, MP3, M4A, etc. -- ffmpeg normalises whatever you hand it.
Output is always mono 24 kHz 16-bit PCM with loudnorm I=-16 TP=-1.5 LRA=11. This is the canonical voice-lab format; the original input is replaced by this normalised render at default.wav.
ALWAYS pair the clip with its transcript. The model conditions on the reference text; without one it ASRs the clip itself, and any mis-hearing (noisy or vintage audio especially) corrupts the conditioning -- the symptom is stammering / stuttered synthesis. voicemode clone add auto-transcribes into voice.md (verify it -- correct mis-hearings by hand); voice-lab's sayas reads <clip>.txt next to each wav; the MCP converse tool takes ref_text alongside a clip-path voice. (Root-caused on VL-50, 2026-06-11: 1977 Doctor Who clips stammered until transcripts were supplied -- then "much better!!!".)

Trimming a too-long clip

If your source is longer than 9 seconds, trim with the same one-liner the runtime error suggests:

ffmpeg -i in.wav -ss 0 -t 8 out.wav

On-disk layout

Voices live as directories under ~/.voicemode/voices/<name>/:

~/.voicemode/voices/fleabag/
├── default.wav        # required: 3-9s of clean reference audio, mono 24kHz 16-bit PCM
└── voice.md           # auto-generated by `voicemode clone add` -- name, source, duration, format, transcript

voice.md carries YAML front matter with name, source (original input path), duration_seconds, format (literal mono 24kHz 16-bit PCM, loudnorm I=-16 TP=-1.5 LRA=11), and transcript. It documents what the clip is and where it came from.

voices.json at the voices root is retained as a legacy index -- voicemode clone add writes an entry pointing at <name>/default.wav so older consumers keep working. Prefer the directory layout above for new work.

Multiple WAVs are allowed alongside default.wav; symlink whichever one is "active" to default.wav. A directory with multiple WAVs and no default.wav is treated as a sample bin and skipped.

Picking a clip

5-9 seconds of clean conversational speech beats 30 seconds of noisy podcast audio. The model copies what it hears -- including hum, music beds, and laugh tracks. See docs/finding-samples.md for ranking heuristics, an mlx-whisper word-timestamp ranker concept, and ffmpeg loudnorm recipes.

Configuration

Variable	Default	Purpose
`VOICEMODE_VOICES_DIR`	`~/.voicemode/voices`	Where voice profiles live
`VOICEMODE_REMOTE_VOICES_DIR`	(unset)	Path on remote mlx-audio host (path translation)
`VOICEMODE_MLX_AUDIO_BASE_URL`	`http://127.0.0.1:8890/v1`	OpenAI-compatible mlx-audio endpoint
`VOICEMODE_IMPRESSIONS_MODEL`	`mlx-community/Qwen3-TTS-12Hz-1.7B-Base-bf16`	Hugging Face model ID

Deprecated aliases (one release only)

The unreleased 8.7.0 candidate used VOICEMODE_CLONE_* names. They're honoured in 8.7.x with a one-shot deprecation warning and removed in 8.8.0:

Deprecated	Use instead
`VOICEMODE_CLONE_BASE_URL`	`VOICEMODE_MLX_AUDIO_BASE_URL`
`VOICEMODE_CLONE_MODEL`	`VOICEMODE_IMPRESSIONS_MODEL`
`VOICEMODE_CLONE_PORT`	`VOICEMODE_MLX_AUDIO_PORT`

If you see those in a user's voicemode.env, suggest updating them.

Footguns

Missing reference transcript = stammering. A clip without its transcript forces the model to ASR the reference itself; on anything but clean modern audio that mis-hears, and the synthesis stutters. Fix: <clip>.txt beside the wav (sayas), ref_text in converse, corrected transcript: in voice.md. See "Reference clip requirements".
Kokoro name collisions -- naming a voice af_sky (or any other Kokoro voice name) shadows the Kokoro voice. Pick distinctive names like fleabag, mike-2026, bryan_morning.
Apple Silicon only -- no fallback for Intel Macs / Linux / Windows. Don't suggest installing mlx-audio on those platforms.
First synthesis is slow -- ~3.4 GB model download on first call. Warn the user.

Deep dives

docs/setup.md -- install path, model quants table, remote mlx-audio config, troubleshooting.
docs/finding-samples.md -- clip ranking heuristic, ffmpeg loudnorm recipe, link to voice-lab.

Impressions guide -- user-facing prose version of this skill.
VoiceMode skill -- primary voice interaction skill.
voice-lab -- companion repo for curating reference clips and personas.

impressions

Popularity

Invocation

Context Preview

SKILL.md

impressions

Popularity

Invocation

Context Preview

SKILL.md

Impressions

When to use this skill

Quick start

Reference clip requirements

Trimming a too-long clip

On-disk layout

Picking a clip

Configuration

Deprecated aliases (one release only)

Footguns

Deep dives

Related

Similar Skills

Impressions

When to use this skill

Quick start

Reference clip requirements

Trimming a too-long clip

On-disk layout

Picking a clip

Configuration

Deprecated aliases (one release only)

Footguns

Deep dives

Related

Similar Skills