From claude-voice-skills
Use when building conversational voice agents (voicebots, callbots, vocal assistants) that need reasoning and tool calling with OpenAI's GPT-Realtime-2 model — including telephony (SIP/Twilio), WhatsApp calls, web (WebRTC), mobile, and meeting bots. Covers session config, audio formats, function calling in parallel, preambles, interruption handling, reasoning levels, and the WebSocket/WebRTC event protocol.
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-voice-skills:voice-agent-realtimeThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill is the operational playbook for building production voice agents on OpenAI's `gpt-realtime-2` model (announced 2026-05-07). Every event name, JSON field, endpoint URL, and pricing number in this skill is traceable to `/Users/youssef/AgentVocalOpco/docs/research/openai-realtime-api-2026-05-11.md`, which itself is sourced from the official OpenAI docs and a live API probe. Items the re...
examples/meeting-bot.pyexamples/nextjs-route.tsexamples/twilio-bridge.pyexamples/webrtc-client.tsxexamples/whatsapp-call.pyreferences/events-websocket.mdreferences/preambles-interruptions.mdreferences/pricing-limits.mdreferences/tool-calling.mdscripts/eval-prompt.pyscripts/evals/adversarial.jsonlscripts/test-latency.pyscripts/test_eval_prompt_unit.pyscripts/test_latency_unit.pytemplates/system-prompt.mdThis skill is the operational playbook for building production voice agents on OpenAI's gpt-realtime-2 model (announced 2026-05-07). Every event name, JSON field, endpoint URL, and pricing number in this skill is traceable to /Users/youssef/AgentVocalOpco/docs/research/openai-realtime-api-2026-05-11.md, which itself is sourced from the official OpenAI docs and a live API probe. Items the research file flagged [UNCONFIRMED] carry the same tag here — do not silently treat them as proven.
gpt-realtime-2).voice-translate-live (powered by gpt-realtime-translate, $0.034/min, no tools, dedicated /v1/realtime/translations endpoint).voice-transcribe-stream (powered by gpt-realtime-whisper, $0.017/min, transcription-only).whisper-1 transcription API or the Chat Completions API with audio modality. The Realtime API is overkill (and more expensive) for non-interactive workloads.| Property | Value |
|---|---|
| Model ID | gpt-realtime-2 |
| Modalities in | Text, audio, image |
| Modalities out | Text, audio |
| Max context window | 128,000 tokens |
| Max output per response | 32,000 tokens |
| Text input | $4.00 / 1M tokens |
| Text cached input | $0.40 / 1M tokens |
| Text output | $24.00 / 1M tokens |
| Audio input | $32.00 / 1M tokens (1 token = 100 ms user audio) |
| Audio cached input | $0.40 / 1M tokens |
| Audio output | $64.00 / 1M tokens (1 token = 50 ms assistant audio) |
| Image input | $5.00 / 1M tokens ($0.50 cached) |
| Reasoning levels | minimal, low, medium, high, xhigh — default low |
| Voices | alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar |
| Audio formats | audio/pcm (PCM16, default 24 kHz), audio/pcmu (μ-law); other formats [UNCONFIRMED] |
| Knowledge cutoff | Sep 30, 2024 |
Full pricing worked examples and context-pressure math → see references/pricing-limits.md.
| Transport | When to use | Latency profile | Channel fit | Endpoint |
|---|---|---|---|---|
| WebRTC | Browser or mobile clients where the device itself owns the microphone and speaker. | Lowest (~hundreds of ms end-to-end). | Web, native iOS/Android. | SDP offer/answer via POST https://api.openai.com/v1/realtime/calls; ephemeral key from POST https://api.openai.com/v1/realtime/client_secrets; data channel name oai-events. |
| WebSocket | Server-side bridges where YOU sit on the audio path (Twilio Media Streams, WhatsApp BSP, meeting-bot recording bridge). | Slightly higher than WebRTC (extra hop through your server). | SIP/telephony, WhatsApp Calls, meeting bots, internal services that need to inspect or log audio. | wss://api.openai.com/v1/realtime?model=gpt-realtime-2 with Authorization: Bearer ${OPENAI_API_KEY}. |
| SIP | Direct PSTN integration. OpenAI accepts inbound SIP and dispatches a webhook to your server, which then accepts/rejects/refers the call. | Server-side; latency similar to WebSocket. | PSTN inbound/outbound. | Inbound SIP URI: sip:[email protected];transport=tls. REST call control: POST /v1/realtime/calls/$CALL_ID/{accept,reject,refer,hangup}. Webhook: realtime.call.incoming. |
[UNCONFIRMED] whether OpenAI-Beta: realtime=v1 is still required — the 2026 docs do not show it; older docs did. Treat as not required.
The 6 protocol steps every voice agent runs through, with the exact event names from the conversational endpoint. Examples are minimal — see references/events-websocket.md for the full payload field catalog.
Server-side (WebSocket):
WS: wss://api.openai.com/v1/realtime?model=gpt-realtime-2
Header: Authorization: Bearer ${OPENAI_API_KEY}
Header: OpenAI-Safety-Identifier: <hashed-user-id> (optional, recommended)
Browser (WebRTC): server first mints an ephemeral key via POST /v1/realtime/client_secrets, the browser then SDP-exchanges with POST /v1/realtime/calls using Authorization: Bearer <ephemeral_key> and Content-Type: application/sdp. The data channel is named oai-events.
On success the server emits session.created with the resolved session config.
session.updateSet model behavior, voice, tools, VAD, reasoning level, and instructions. Can be re-sent any time mid-session.
{
"type": "session.update",
"session": {
"type": "realtime",
"model": "gpt-realtime-2",
"output_modalities": ["audio"],
"voice": "marin",
"audio": {
"input": { "format": { "type": "audio/pcm", "rate": 24000 },
"turn_detection": { "type": "server_vad", "threshold": 0.5,
"prefix_padding_ms": 300, "silence_duration_ms": 500,
"create_response": true, "interrupt_response": true } },
"output": { "format": { "type": "audio/pcm" }, "voice": "marin" }
},
"instructions": "You are a phone support agent. Be concise. Before any tool call >200ms, briefly tell the user what you are doing (\"one moment, checking that\").",
"tools": [/* see references/tool-calling.md */],
"tool_choice": "auto",
"reasoning": { "effort": "low" }
}
}
[UNCONFIRMED] exact path for reasoning.effort — the prompting guide names the field but didn't show a full JSON example. The shape above is the most likely fit; probe before relying.
Server replies with session.updated.
input_audio_buffer.appendAudio chunks, base64-encoded PCM16 (or μ-law if audio/pcmu):
{ "type": "input_audio_buffer.append", "audio": "<base64 audio chunk>" }
Send roughly every 20–100 ms. With server VAD enabled, the server emits input_audio_buffer.speech_started, then input_audio_buffer.speech_stopped, then input_audio_buffer.committed when the turn ends. If you've disabled VAD, send input_audio_buffer.commit yourself.
response.create (often implicit)When turn_detection.create_response: true, the server fires response.create automatically on speech_stopped. To trigger manually:
{ "type": "response.create" }
You can also override session config for one turn:
{
"type": "response.create",
"response": {
"output_modalities": ["audio"],
"instructions": "Acknowledge briefly, then call the get_account tool."
}
}
Server replies with response.created.
response.output_audio.delta / .done< response.created
< response.output_item.added (e.g., a "message" with audio content)
< response.content_part.added
< response.output_audio.delta (base64 audio chunk — feed to your speaker)
< response.output_audio.delta
...
< response.output_audio_transcript.delta (assistant captions, optional but useful)
< response.output_audio.done
< response.output_audio_transcript.done
< response.content_part.done
< response.output_item.done
< response.done (full usage data here)
response.function_call_arguments.delta / .done / submit resultsWhen the model decides to call a function:
< response.output_item.added (item.type: "function_call", item.call_id, item.name)
< response.function_call_arguments.delta (streaming JSON arguments)
< response.function_call_arguments.delta ...
< response.done (final arguments guaranteed in response.output[])
[UNCONFIRMED] — response.function_call_arguments.done is not enumerated in the conversations guide (research item #1). Final arguments are guaranteed on response.done; use that as the authoritative completion signal until a live probe confirms .done.
Your client then:
{ "type": "conversation.item.create",
"item": { "type": "function_call_output",
"call_id": "call_aaa",
"output": "{\"temp_c\":21}" } }
…and finally { "type": "response.create" } to trigger the spoken answer.
gpt-realtime-2 supports plain function tools and MCP tools. Tools are declared on session.tools[] with { type, name, description, parameters }. tool_choice accepts "auto" or "required" (and [UNCONFIRMED] "none").
Parallel tool calls are explicitly supported on gpt-realtime-2 — a single turn can emit multiple function_call items whose response.function_call_arguments.delta streams interleave by call_id. Run them in parallel, submit all function_call_output items, then send one response.create for the consolidated spoken answer.
MCP servers can be declared as tools too (type: "mcp"); OpenAI's service then talks to your MCP server directly and you receive lifecycle events (mcp_list_tools.in_progress, response.mcp_call.in_progress, etc.).
→ see references/tool-calling.md
Preambles are short utterances ("let me check that for you") the model emits between response.created and response.function_call_arguments.delta. They're produced naturally when your instructions say so — e.g., "Before any tool call >200ms, briefly tell the user what you are doing." They are normal audio (response.output_audio.delta), just routed into a content part before the function-call item.
Interruption (barge-in): with server_vad + interrupt_response: true, when the user starts speaking again the server auto-cancels the in-flight response. Your client should also act synchronously on input_audio_buffer.speech_started: drain your local audio buffer, send response.cancel, and send conversation.item.truncate with the audio_end_ms of what the user actually heard (not what you received from the server).
→ see references/preambles-interruptions.md
gpt-realtime-2 supports 5 reasoning levels, configured via session.reasoning.effort ([UNCONFIRMED] exact JSON path, see step 2 above):
| Level | When to use |
|---|---|
minimal | Greet/farewell flows, IVR-like simple routing. Fastest, cheapest. |
low (default) | Standard conversational agents. Single-tool turns. Most production callbots. |
medium | Multi-step requests with 2–3 tools. Light disambiguation. |
high | Complex multi-tool reasoning (parallel tool selection, conditional flows, adversarial users trying to break your agent). |
xhigh | Last-resort for hard adversarial / multi-constraint problems. Latency cost is significant. |
Default is low. Bump higher when the model fails on multi-step reasoning evals; drop to minimal only for latency-critical single-purpose flows. Higher reasoning levels can suppress the speed of preambles — verify on your eval set after changing.
POST /v1/realtime/client_secrets.POST /v1/realtime/calls (Content-Type: application/sdp, Authorization: Bearer <ephemeral>).oai-events carries the JSON event stream.→ see examples/webrtc-client.tsx (to be added)
gpt-realtime-2 documents audio/pcmu for μ-law input/output, so you can pass through unchanged — OR resample to PCM16 24 kHz on the bridge for higher fidelity.<Dial><Sip> at sip:[email protected];transport=tls and use the realtime.call.incoming webhook + REST /accept to configure the session.[UNCONFIRMED] whether 8 kHz μ-law works without resampling on the new endpoint — research item #8. Test before committing to a wire format.→ see examples/twilio-bridge.py (to be added)
→ see examples/whatsapp-call.py (to be added)
webrtc-ios, webrtc-android) connect directly to the OpenAI WebRTC endpoint after your server mints an ephemeral key.response.output_audio.delta stream.→ see examples/meeting-bot.py (to be added)
audio/pcmu if you keep μ-law; resample on your bridge if you go to PCM16.response.create per parallel tool output. Don't. Submit all function_call_output items first, then send a single response.create for the consolidated spoken response.response.cancel stops the server, but you still have queued audio in your speaker pipeline. Clear it synchronously on input_audio_buffer.speech_started.[UNCONFIRMED] items as proven. Especially response.function_call_arguments.done, the exact JSON path for reasoning.effort, and audio format strings beyond audio/pcm/audio/pcmu. Probe a live session before depending on them.threshold: 0.5 will spuriously trigger on background hum at telephony quality. Bump to 0.6–0.7 or switch to semantic_vad with eagerness: "low".prompt for very long sessions.conversation.item.truncate after barge-in. Without it, the model thinks it spoke a sentence the user never heard, and may not repeat critical info.OpenAI-Beta: realtime=v1. [UNCONFIRMED] whether still accepted/required. 2026 docs don't show it. Leave it out unless a 400 forces you to add it.The OpenAI announcement cites Zillow taking a real estate voice agent from 69% → 95% task accuracy on adversarial eval prompts after migrating to gpt-realtime-2. The headline number isn't the point — the methodology is:
prompt field, repeat.→ see scripts/eval-prompt.py (to be added)
Long-form references:
references/events-websocket.md — exhaustive client→server and server→client event catalog with payload field names.references/tool-calling.md — tool definition shape, parallel calls, function_call_output submission, MCP.references/preambles-interruptions.md — preamble prompt patterns, VAD modes, barge-in flow, common bugs.references/pricing-limits.md — full pricing table, 3 worked cost examples, 128K context math, pruning patterns.Examples (to be added):
examples/webrtc-client.tsx — browser WebRTC connection + ephemeral-key mint.examples/twilio-bridge.py — Twilio Media Streams ↔ OpenAI WebSocket bridge.examples/whatsapp-call.py — WhatsApp Calls via BSP bridge.examples/meeting-bot.py — Recall.ai meeting-bot integration.Scripts (to be added):
scripts/eval-prompt.py — adversarial eval harness for voice agents.Templates (to be added):
Canonical source for every fact in this skill: /Users/youssef/AgentVocalOpco/docs/research/openai-realtime-api-2026-05-11.md.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub generovo/claude-voice-skills --plugin claude-voice-skills