From babel-fish
Generate audio content — text-to-speech, podcasts, voice cloning, sound effects, speech-to-speech, dubbing, and audio isolation. Currently powered by ElevenLabs. Works with both the Python SDK and the ElevenLabs CLI. Includes ready-to-run generator scripts that Claude writes to a temp file and executes directly. Triggers: audio, elevenlabs, text-to-speech, TTS, podcast, voice, voiceover, narration, voice clone, sound effects, dubbing, speech-to-speech, audio isolation.
How this skill is triggered — by the user, by Claude, or both
Slash command
/babel-fish:audioThis skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
Generate audio — from single-line TTS to multi-voice podcasts. Currently powered by ElevenLabs.
Generate audio — from single-line TTS to multi-voice podcasts. Currently powered by ElevenLabs. Includes ready-to-run Python scripts. Claude writes them to a temp file and executes directly.
This skill MAY: install the SDK, generate audio files, list voices, create voice clones, write and execute generation scripts, play audio.
This skill MAY NOT: store API keys in code (use env vars or ~/.elevenlabs/api_key), commit audio files to git, generate audio without user approval of the script first.
| Shortcut | Why It Fails | The Cost |
|---|---|---|
| Hardcode API key in script | Leaks credentials to git history | Security incident |
| Skip voice selection | Default voice may not match content tone | Wasted credits on re-gen |
| Generate full podcast without preview | Long audio = expensive; mistakes compound | Non-refundable credits |
| Import pydub for concatenation | Broken on Python 3.13+ (audioop removed) | Runtime crash |
| Use VoiceSettings with cloned voices | Custom settings destabilize cloned voices | Garbled/robotic audio |
Use ... for pauses | Causes hesitation/nervousness artifacts | Unnatural stuttering |
| Use large chunks for long content | Quality degrades in second half | Robotic pacing |
Skip language_code with accented speakers | Model guesses language from accent | Chinese/French mid-narration |
Entry: User wants audio content.
Check in order — use the first one found:
# 1. Check CLI auth (preferred — already logged in)
elevenlabs auth whoami --no-ui 2>/dev/null
# 2. Check env var
echo "${ELEVENLABS_API_KEY:+set}"
# 3. Check stored key file
cat ~/.elevenlabs/api_key 2>/dev/null | head -c 10
API key resolution in Python (use this in ALL scripts):
import os
def get_api_key() -> str:
"""Resolve ElevenLabs API key from CLI store, env var, or fail."""
key_file = os.path.expanduser("~/.elevenlabs/api_key")
if os.path.exists(key_file):
return open(key_file).read().strip()
key = os.environ.get("ELEVENLABS_API_KEY", "")
if key:
return key
raise RuntimeError(
"No ElevenLabs API key found. Run `elevenlabs auth login` or "
"set ELEVENLABS_API_KEY environment variable."
)
python3 -c "import elevenlabs" 2>/dev/null || uv pip install --system --break-system-packages elevenlabs
IMPORTANT: Do NOT install pydub. It's broken on Python 3.13+ (audioop removed). The scripts below use raw MP3 byte concatenation — MP3 is a frame-based format and files can be concatenated directly.
IMPORTANT: On Python 3.14+, client.text_to_speech.convert() returns a generator, not
bytes. All scripts below use a to_bytes() helper to normalize this.
python3 -c "
from elevenlabs.client import ElevenLabs
import os
key_file = os.path.expanduser('~/.elevenlabs/api_key')
api_key = open(key_file).read().strip() if os.path.exists(key_file) else os.environ.get('ELEVENLABS_API_KEY', '')
client = ElevenLabs(api_key=api_key)
voices = client.voices.get_all()
print(f'Connected. {len(voices.voices)} voices available.')
for v in voices.voices[:10]:
labels = dict(v.labels) if v.labels else {}
print(f' {v.voice_id} | {v.name:25s} | {labels.get(\"accent\", \"\")} {labels.get(\"gender\", \"\")}')
"
CRITICAL: Voice IDs are account-specific. Never hardcode voice IDs from examples or documentation — always run Step 3 first to discover the actual IDs available.
Exit: Auth verified, SDK installed, voices listed.
Entry: User wants a single audio file from text (< 5,000 chars).
#!/usr/bin/env python3
"""ElevenLabs TTS generator."""
import os
from elevenlabs.client import ElevenLabs
TEXT = """Your text here."""
VOICE_ID = "FILL_FROM_VOICE_LIST"
MODEL_ID = "eleven_multilingual_v2"
OUTPUT_PATH = "output.mp3"
def to_bytes(audio) -> bytes:
return audio if isinstance(audio, bytes) else b"".join(audio)
key_file = os.path.expanduser("~/.elevenlabs/api_key")
api_key = open(key_file).read().strip() if os.path.exists(key_file) else os.environ["ELEVENLABS_API_KEY"]
client = ElevenLabs(api_key=api_key)
print(f"Generating {len(TEXT)} chars with {MODEL_ID}...")
audio = to_bytes(client.text_to_speech.convert(
text=TEXT,
voice_id=VOICE_ID,
model_id=MODEL_ID,
output_format="mp3_44100_128",
language_code="en", # ALWAYS set for cloned/accented voices
))
with open(OUTPUT_PATH, "wb") as f:
f.write(audio)
print(f"Saved to {OUTPUT_PATH} ({os.path.getsize(OUTPUT_PATH) / 1024:.0f} KB)")
Exit: Audio file saved.
Entry: User wants narration of long-form content (> 5,000 chars).
THIS IS THE CRITICAL PHASE. Long-form audio requires special handling to maintain quality throughout. The approach below was battle-tested and is the only one that produces consistent quality across 10+ minute narrations.
Create a separate speech-text.md adapted for listening:
| Written form | Speech form | Why |
|---|---|---|
90% | ninety percent | TTS mispronounces digits |
1.7 times | one point seven times | Same |
2 AM | two in the morning | Natural speech |
Kačka | Kachka | Phonetic for TTS |
Žaneta | Zhaneta | Phonetic for TTS |
Aibility | Eigh-bility | Phonetic — write directly in text |
**bold text** | bold text | Strip all markdown |
--- | (remove) | Strip section breaks |
Pause control:
<break time="0.7s" /> — sub-section pause (v2 supports SSML break tags)<break time="1.0s" /> — major section transition<break time="1.2s" /> — thesis/key moment (max recommended)... — causes hesitation/nervousness artifactsWhat NOT to do:
<lexeme> tags — they get read aloud as textWhy this approach: Large chunks (4000+ chars) degrade in quality — the model loses emotional range and natural pacing in the second half. Small chunks (800-1200 chars) stay high quality. Request stitching chains them together for continuity.
CRITICAL for cloned voices:
language_code="en" is mandatory — without it, the model guesses language from
accent and can switch to Chinese/French mid-narration#!/usr/bin/env python3
"""ElevenLabs long-form narration with request stitching.
Splits text into small chunks, chains via previous_request_ids for
continuity, uses httpx directly to access request-id headers.
"""
import os
import httpx
# --- CONFIG ---
SPEECH_TEXT_PATH = "speech-text.md"
VOICE_ID = "FILL_FROM_VOICE_LIST"
OUTPUT_PATH = "speech.mp3"
CHUNK_SIZE = 1000 # chars per chunk — keep 800-1200 for quality
LANGUAGE_CODE = "en" # ALWAYS set for cloned/accented voices
# --- END CONFIG ---
api_key_file = os.path.expanduser("~/.elevenlabs/api_key")
api_key = open(api_key_file).read().strip() if os.path.exists(api_key_file) else os.environ["ELEVENLABS_API_KEY"]
with open(SPEECH_TEXT_PATH, "r") as f:
text = f.read()
# Split into small chunks at paragraph boundaries
paragraphs = text.split("\n\n")
chunks, current = [], ""
for p in paragraphs:
if len(current) + len(p) + 2 > CHUNK_SIZE and current.strip():
chunks.append(current.strip())
current = p
else:
current = f"{current}\n\n{p}" if current else p
if current.strip():
chunks.append(current.strip())
print(f"Script: {len(text)} chars -> {len(chunks)} chunks")
for i, c in enumerate(chunks):
print(f" Chunk {i+1}: {len(c)} chars")
url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}"
headers = {"xi-api-key": api_key, "Content-Type": "application/json"}
all_audio = b""
prev_request_id = None
for i, chunk in enumerate(chunks):
data = {
"text": chunk,
"model_id": "eleven_multilingual_v2",
"output_format": "mp3_44100_128",
"language_code": LANGUAGE_CODE,
}
# Chain to previous chunk for prosody continuity
if prev_request_id:
data["previous_request_ids"] = [prev_request_id]
# Give forward context from next chunk
if i + 1 < len(chunks):
data["next_text"] = chunks[i + 1][:500]
print(f" [{i+1}/{len(chunks)}] {len(chunk)} chars...", end=" ", flush=True)
resp = httpx.post(url, json=data, headers=headers, timeout=60)
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text[:200]}")
break
prev_request_id = resp.headers.get("request-id")
all_audio += resp.content
print(f"done ({len(resp.content)//1024} KB)")
with open(OUTPUT_PATH, "wb") as f:
f.write(all_audio)
size_mb = os.path.getsize(OUTPUT_PATH) / (1024 * 1024)
print(f"\nSaved to {OUTPUT_PATH} ({size_mb:.1f} MB)")
Always test first: Generate chunks 1-2 as a preview clip before committing to the full generation. Credits are non-refundable.
Listen to the full audio. If specific sections sound off:
previous_request_ids (from the preceding chunk)
and next_request_ids (from the following chunk) to maintain flowExit: Long-form narration audio saved.
Entry: User wants a custom voice from their audio.
| Requirement | Details |
|---|---|
| Duration | 1-2 minutes (more than 3 min can be detrimental) |
| Content | Read your own writing — natural intonation matches best |
| Quality | Quiet room, no background noise, consistent distance from mic |
| Format | MP3 128kbps or higher, mono or stereo |
| Style | Consistent pace and tone — the clone replicates EVERYTHING |
| Avoid | Stumbles, "uhm"s, long pauses, whispers, shouting, music |
CRITICAL: Do NOT pre-process the recording with ffmpeg filters (silenceremove, loudnorm, etc.). These strip voice characteristics the clone needs. The only acceptable preprocessing is trimming to length.
from elevenlabs import ElevenLabs
client = ElevenLabs(api_key=get_api_key())
voice = client.voices.ivc.create(
name="User Voice",
description="Natural speaking voice for narration",
files=[open("recording.mp3", "rb")],
remove_background_noise=False, # Preserve voice characteristics
)
print(f"Voice ID: {voice.voice_id}")
After cloning, ALWAYS test with a short clip before generating long content:
audio_gen = client.text_to_speech.convert(
text="A short test sentence to verify the voice sounds right.",
voice_id=voice.voice_id,
model_id="eleven_multilingual_v2",
output_format="mp3_44100_128",
language_code="en",
# DO NOT pass voice_settings — defaults are best for clones
)
| Model | Works with clones? | Notes |
|---|---|---|
eleven_multilingual_v2 | YES — use this | Best voice fidelity with clones |
eleven_v3 | NO | Smooth output but voice identity completely lost |
eleven_flash_v2_5 | Untested | May work, lower quality expected |
eleven_turbo_v2_5 | Untested | May work |
Do NOT override VoiceSettings for cloned voices. Default settings produce the best results. Every combination tested (stability 0.3-0.8, similarity 0.5-1.0, style 0.3-0.7, speaker boost on/off) made the output worse — garbled, robotic, or unnatural pacing.
If you must tweak, test with a single sentence first and compare to the no-settings version before committing to a full generation.
Exit: Custom voice created and tested.
Entry: User wants podcast-style audio (single voice, long content).
Use the Phase 2 Long-Form Narration approach with request stitching.
The old approach (4500-char chunks with previous_text) produces lower
quality than small chunks with previous_request_ids.
#!/usr/bin/env python3
"""ElevenLabs multi-voice podcast generator."""
import os
from elevenlabs.client import ElevenLabs
SEGMENTS = [
("VOICE_ID_HOST", "Welcome to the show..."),
("VOICE_ID_GUEST", "Thanks for having me..."),
]
MODEL_ID = "eleven_multilingual_v2"
OUTPUT_PATH = "dialogue-podcast.mp3"
def to_bytes(audio) -> bytes:
return audio if isinstance(audio, bytes) else b"".join(audio)
client = ElevenLabs(api_key=get_api_key())
audio_parts = []
for i, (voice_id, text) in enumerate(SEGMENTS):
print(f" [{i+1}/{len(SEGMENTS)}] {text[:50]}...")
audio_bytes = to_bytes(client.text_to_speech.convert(
text=text,
voice_id=voice_id,
model_id=MODEL_ID,
output_format="mp3_44100_128",
language_code="en",
))
audio_parts.append(audio_bytes)
with open(OUTPUT_PATH, "wb") as f:
for part in audio_parts:
f.write(part)
print(f"Saved to {OUTPUT_PATH}")
audio = client.text_to_sound_effects.convert(
text="Heavy rain on a tin roof with distant thunder",
duration_seconds=10.0,
)
with open("rain.mp3", "wb") as f:
f.write(to_bytes(audio))
Tips: be specific ("footsteps on gravel" > "walking sounds"), include environment, specify duration.
with open("input.mp3", "rb") as f:
input_audio = f.read()
transformed = to_bytes(client.speech_to_speech.convert(
audio=input_audio,
voice_id="target_voice_id",
model_id="eleven_english_sts_v2",
))
with open("transformed.mp3", "wb") as f:
f.write(transformed)
with open("noisy.mp3", "rb") as f:
clean = to_bytes(client.audio_isolation.audio_isolation(audio=f.read()))
with open("clean.mp3", "wb") as f:
f.write(clean)
elevenlabs auth login # Interactive API key setup
elevenlabs auth whoami --no-ui # Check status
elevenlabs auth logout # Remove stored key
The CLI is focused on agent management, NOT TTS. For TTS, use the Python SDK.
| Model ID | Best For | Char Limit | Latency | Clone Support |
|---|---|---|---|---|
eleven_multilingual_v2 | Long-form, cloned voices | 10,000 | Standard | YES |
eleven_v3 | Dramatic, expressive (stock voices) | 5,000 | ~300ms | NO — loses identity |
eleven_flash_v2_5 | Ultra-low latency | 40,000 | ~75ms | Untested |
eleven_turbo_v2_5 | Quality + speed | 40,000 | ~250ms | Untested |
Using a cloned voice?
├─ Yes → eleven_multilingual_v2 (only reliable option)
└─ No → Content > 5,000 chars?
├─ Yes → eleven_multilingual_v2
└─ No → Need dramatic delivery?
├─ Yes → eleven_v3
└─ No → Need low latency?
├─ Yes → eleven_flash_v2_5
└─ No → eleven_turbo_v2_5
| Method | Works? | Notes |
|---|---|---|
<break time="0.7s" /> | YES (v2 only) | SSML break tag, up to 3s. Use sparingly (max 5-6 per generation) |
| Paragraph breaks | YES | Natural, reliable, no cost |
| Short sentences | YES | Best method — rhythm from writing |
... ellipsis | NO | Causes hesitation/nervousness artifacts |
Multiple dashes -- -- | Somewhat | Inconsistent |
| Method | Works? | Notes |
|---|---|---|
| Phonetic spelling in text | YES | Most reliable: "Eigh-bility" instead of "Aibility" |
| Pronunciation dictionary API | UNRELIABLE | Silently ignored with some model/voice combos |
<lexeme> tags in text | NO | Read aloud as text |
<phoneme> SSML tags | v2: NO, Flash v2: YES | Only works with specific models |
Rule: Always use phonetic spelling directly in the speech text. Don't rely on dictionaries or SSML phoneme tags.
| Format | Quality | Use Case |
|---|---|---|
mp3_44100_128 | High | Default, general purpose |
mp3_44100_192 | Highest MP3 | Archival |
pcm_44100 | Lossless | Post-processing |
eleven_flash_v2_5 is 50% cheaper than other models# When using httpx directly (for request stitching):
resp = httpx.post(url, json=data, headers=headers, timeout=60)
if resp.status_code == 401:
print("Bad API key.")
elif resp.status_code == 400 and "quota_exceeded" in resp.text:
print("Out of credits.")
elif resp.status_code != 200:
print(f"Error {resp.status_code}: {resp.text[:200]}")
~/.elevenlabs/api_key or env var, never hardcodedlanguage_code set for cloned or accented voices... ellipses in speech textnpx claudepluginhub ondrej-svec/heart-of-gold-toolkit --plugin babel-fishGuides voice cloning (ElevenLabs, HeyGen, Vbee) and AI audio production for podcasts, audiobooks, and voiceovers. Includes repurposing one podcast into ten short clips.
Creates single-voice audio content like audiobooks, voiceovers, narrations, jingles, and ads via TTS orchestration, background music, and FFmpeg assembly.
Clones voices via ElevenLabs Instant Voice Cloning pipeline: sourcing reference audio, preparing samples, uploading for IVC, testing with TTS, and tuning settings.