From claude-voice-skills
Use when building live voice translation features with OpenAI's GPT-Realtime-Translate model — interpreters, multilingual calls, conferences, live video translation. Covers the dedicated `/v1/realtime/translations` endpoint, multi-speaker session architecture, language support (70+ in / 13 out). Do NOT use for conversational agents (see voice-agent-realtime).
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-voice-skills:voice-translate-liveThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill is the operational playbook for building real-time voice translation on OpenAI's `gpt-realtime-translate` model (announced 2026-05-07). Every event name, JSON field, endpoint URL, and pricing number traces back to `/Users/youssef/AgentVocalOpco/docs/research/openai-realtime-api-2026-05-11.md`, itself sourced from the official OpenAI docs and a live API probe. Items the research file ...
This skill is the operational playbook for building real-time voice translation on OpenAI's gpt-realtime-translate model (announced 2026-05-07). Every event name, JSON field, endpoint URL, and pricing number traces back to /Users/youssef/AgentVocalOpco/docs/research/openai-realtime-api-2026-05-11.md, itself sourced from the official OpenAI docs and a live API probe. Items the research file flagged [UNCONFIRMED] carry the same tag here — do not silently treat them as proven.
READ THIS BEFORE WRITING ANY EVENT NAME. The translation endpoint uses a different event vocabulary from the conversational endpoint. Audio buffer events are prefixed
session.(e.g.session.input_audio_buffer.append). The conversational endpoint uses unprefixed names. Mixing them silently breaks — the wire accepts the JSON but the model never sees your audio. See## Event vocabulary — CRITICAL DIFFERENCE.
voice-agent-realtime (powered by gpt-realtime-2). Translate has no tool calling and no reasoning levels — it is a direct audio-in / audio-out translator.voice-transcribe-stream (powered by gpt-realtime-whisper, $0.017/min). You will already get a source-language transcript for free as a side-effect of translation, but if transcription is your sole goal, the whisper model is half the price.whisper-1 for transcription plus a regular Chat Completions call for translation. The Realtime API is overkill for non-interactive workloads.| Property | Value |
|---|---|
| Model ID | gpt-realtime-translate |
| Modalities in | Audio |
| Modalities out | Audio + text (target-language audio + concurrent source/target transcripts) |
| Tool calling | Not supported |
| Reasoning levels | Not applicable |
| Pricing | $0.034 / minute of realtime audio duration |
| Max context window | 16,000 tokens |
| Max output per response | 2,000 tokens |
| Input languages | 70+ |
| Output languages | 13 |
| Audio format | PCM16, 24 kHz, base64-encoded on the wire |
| Voice config | [UNCONFIRMED] — translation guide did not show a voice field; model card lists audio output but only audio.output.language is shown in examples |
| Knowledge cutoff | Sep 30, 2024 |
| Benchmark cited | 12.5% lower WER than competing models on Hindi/Tamil/Telugu (BolnaAI eval, OpenAI announcement) |
Full pricing worked examples → references/pricing-limits.md.
WebSocket: wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate
WebRTC SDP exchange: POST https://api.openai.com/v1/realtime/translations/calls
Ephemeral token mint: POST https://api.openai.com/v1/realtime/translations/client_secrets
Header: Authorization: Bearer ${OPENAI_API_KEY}
Header: OpenAI-Safety-Identifier: <hashed-user-id> (optional, recommended)
The base path /v1/realtime/translations is distinct from the conversational base path /v1/realtime. Anything that worked on the conversational endpoint is not guaranteed to work here.
The translation endpoint uses a simplified, prefixed event vocabulary. The research file marks this as a [CONTRADICTION] against the conversational endpoint, and it is the single most common source of silent failure when porting code between the two.
| Action | Conversational endpoint | Translation endpoint |
|---|---|---|
| Configure session | session.update | session.update (same name; different fields) |
| Append audio chunk | input_audio_buffer.append | session.input_audio_buffer.append |
| Streamed translated audio (server→client) | response.output_audio.delta | session.output_audio.delta |
| Streamed target-language transcript | response.output_audio_transcript.delta | session.output_transcript.delta |
| Streamed source-language transcript | (n/a — model only sees user audio) | session.input_transcript.delta |
| Manual commit, response.create, cancel, truncate | Available | [UNCONFIRMED] — translation guide did not enumerate them; translation is a continuous-stream model so response.create likely does not apply |
.done / .completed server events | Enumerated for each .delta | [UNCONFIRMED] — only .delta shown in docs; likely exist, probe before relying |
| Tool calling, MCP, function_call_output | Supported | Not supported |
WARNING. Do NOT mix vocabularies. The translation endpoint silently drops unprefixed events — the connection stays open, you receive no error, and no translated audio ever comes back. Before sending audio, grep your own code for the literal
input_audio_buffer.appendand confirm every occurrence is preceded bysession..
[UNCONFIRMED] lifecycle events. session.created, session.updated, and error almost certainly exist (the WebSocket needs to acknowledge config and report errors), but the translation guide did not explicitly enumerate them. Treat their presence as probable but not contractual.
The rule: one session per output language, with separate audio tracks per speaker muxed into the same input direction. Mixing multiple target languages into one session is unsupported — each session targets a single output language via audio.output.language. Mixing speakers' audio into one input stream is supported (the model translates whoever is speaking), but you typically want per-speaker tracks so you can label and route the output cleanly.
┌───── session A: out = "en" ─────► Listener A hears English
Speaker 1 (fr) ──┤
└───── session B: out = "es" ─────► Listener B hears Spanish
┌─── (same two sessions reused) ────►
Speaker 2 (ar) ──┤
└─── (same two sessions reused) ────►
Why one session per output language:
audio.output.language is set once and applies to every translated chunk that session emits. To get two output languages from the same source audio, you need two parallel sessions, each receiving the same input audio frames.session.output_transcript.delta) are emitted in the session's configured target language; a single session cannot produce multilingual transcripts.Scaling math. For N speakers and M target languages, you may need up to N × M sessions in the worst case (each listener pair) — but in practice you collapse along the speaker axis: M sessions total, each fed the mixed input audio from all speakers, each producing one language output. Add per-speaker source-language transcript routing if you want to label who said what in the captions UI.
Audio fan-out. Your bridge owns the fan-out: read user audio once from the transport (Twilio, WebRTC, meeting bot), base64-encode it once, and session.input_audio_buffer.append it to each target-language WebSocket in parallel. Don't open one session and pray.
[UNCONFIRMED] — not on either source page fetched.[UNCONFIRMED] — not on either source page fetched. The translation guide example uses ISO-639-1 code "es" (Spanish); other codes follow the same standard.Always use ISO-639-1 two-letter codes (en, fr, es, ar, zh, …) for audio.output.language until docs publish the canonical list.
→ see references/languages.md for known input/output language details and pair-selection tips.
The translation endpoint accepts WebSocket or WebRTC. Your choice depends on who owns the audio path.
session.input_audio_buffer.append.session.output_audio.delta back to the appropriate Twilio leg.→ see examples/twilio-interpreter.py (to be added)
POST /v1/realtime/translations/client_secrets and the browser SDP-exchanges with POST /v1/realtime/translations/calls.→ see examples/web-conference.tsx (to be added)
whisper-1 and translate via Chat Completions, then synthesize with the TTS API.→ see examples/whatsapp-translate.py (to be added)
→ see examples/meeting-translator.py (to be added)
Translation is inherently higher-latency than a conversational response: the model must hear enough of an utterance to translate coherently, then synthesize target-language speech. Expect ~1–2 seconds end-to-end for short utterances, more for long ones. Do not target the <500ms time-to-first-audio numbers that apply to conversational gpt-realtime-2.
Practical implications:
| Property | Value |
|---|---|
| Pricing | $0.034 / minute of audio duration |
| Context window | 16,000 tokens |
| Max output | 2,000 tokens |
| Billing unit | Audio duration (not tokens) — [UNCONFIRMED] whether this counts input minutes only, output minutes only, or both. Treat as wall-clock session minutes per session for conservative estimates. |
| Rate limits | [UNCONFIRMED] — see OpenAI platform docs for current per-org limits |
Full worked examples (5-min support call, 1-hour 4-speaker / 3-language conference, 24/7 hotline) → references/pricing-limits.md.
input_audio_buffer.append (no prefix) to /v1/realtime/translations results in silent drops. Always session.input_audio_buffer.append here.## Multi-speaker architecture.[UNCONFIRMED] whether audio/pcmu is even accepted on this endpoint — the docs only showed PCM16.response.create-driven turn control. Translation is continuous-stream. There is no per-turn response.create in the documented vocabulary. Drive the session via VAD.session.created, session.updated, or error until you have probed and observed them on a live session.Long-form references (this dispatch):
references/languages.md — input / output language details, pair-selection tips, code-switching notes.references/session-config.md — exact session.update JSON shape for the translation endpoint, field-by-field reference, comparison with conversational sessions.references/pricing-limits.md — full pricing table, 3 worked cost examples (5-min support call, 1-hour multi-language conference, 24/7 multilingual hotline), context limits.Examples (to be added):
examples/twilio-interpreter.py — Twilio Media Streams ↔ OpenAI translation bridge with two parallel sessions for bidirectional interpretation.examples/web-conference.tsx — browser WebRTC client for one-speaker, multiple-listener-language conference.examples/whatsapp-translate.py — WhatsApp Calls translator via Business Solution Provider bridge.examples/meeting-translator.py — Recall.ai meeting-bot integration with per-attendee target language.Scripts (to be added):
scripts/probe-events.py — live event-vocabulary probe to resolve [UNCONFIRMED] items (session lifecycle, .done events, voice config).Sample files (to be added):
Canonical source for every fact in this skill: /Users/youssef/AgentVocalOpco/docs/research/openai-realtime-api-2026-05-11.md.
npx claudepluginhub generovo/claude-voice-skills --plugin claude-voice-skillsCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.