From claude-voice-skills
Use when building streaming speech-to-text with OpenAI's GPT-Realtime-Whisper model — live captions, meeting transcription, voice notes, low-latency transcripts. Covers progressive deltas, latency tuning profiles (0.4s / 0.8-1.2s / 1.5-2s), and the streaming protocol. Do NOT use for conversational agents (see voice-agent-realtime) or translation (see voice-translate-live).
How this skill is triggered — by the user, by Claude, or both
Slash command
/claude-voice-skills:voice-transcribe-streamThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
This skill is the operational playbook for building real-time streaming speech-to-text on OpenAI's `gpt-realtime-whisper` model (announced 2026-05-07). Every event name, JSON field, endpoint URL, and pricing number traces back to `/Users/youssef/AgentVocalOpco/docs/research/openai-realtime-api-2026-05-11.md`, itself sourced from the official OpenAI docs and a live API probe. Items the research ...
examples/live-captions.tsxexamples/meeting-minutes.pyexamples/twilio-transcribe.pyexamples/whatsapp-voice-notes.pyreferences/latency-tuning.mdreferences/pricing-limits.mdreferences/streaming-deltas.mdscripts/samples/audio-pairs.jsonlscripts/test-latency-vs-quality.pyscripts/test_latency_vs_quality_unit.pyThis skill is the operational playbook for building real-time streaming speech-to-text on OpenAI's gpt-realtime-whisper model (announced 2026-05-07). Every event name, JSON field, endpoint URL, and pricing number traces back to /Users/youssef/AgentVocalOpco/docs/research/openai-realtime-api-2026-05-11.md, itself sourced from the official OpenAI docs and a live API probe. Items the research file flagged [UNCONFIRMED] carry the same tag here — do not silently treat them as proven.
READ THIS BEFORE WIRING THE UI. The transcription endpoint emits progressive deltas that may be revised before being finalized. Render
conversation.item.input_audio_transcription.deltaas speculative (italic / grayed) and only commit on.completed. See## Progressive deltas — the core UX pattern.
gpt-realtime-2 audio-in / audio-out pipeline.voice-agent-realtime (powered by gpt-realtime-2). Whisper streams text only — no audio output, no reasoning, no tool calls.voice-translate-live (powered by gpt-realtime-translate, $0.034/min). Whisper transcribes in the source language only.whisper-1 or gpt-4o-transcribe REST endpoint. The realtime API adds infrastructure cost for no benefit on async workloads.gpt-4o-transcribe-diarize (separate model in the live model list). [UNCONFIRMED] whether gpt-realtime-whisper supports diarization — the model card and transcription guide did not mention it; assume not, until probed.| Property | Value |
|---|---|
| Model ID | gpt-realtime-whisper |
| Modalities in | Audio, text (text input for prompt / context hinting) |
| Modalities out | Text only (no audio output) |
| Tool calling | Not applicable |
| Reasoning levels | Not applicable |
| Pricing | $0.017 / minute of audio duration |
| Max context window | 16,000 tokens |
| Max output per response | 2,000 tokens |
| Audio format | audio/pcm PCM16 at 24 kHz (shown in docs example); [UNCONFIRMED] whether audio/pcmu (μ-law for telephony) is accepted |
| Sample rate | 24,000 Hz in the example; [UNCONFIRMED] whether 16/8 kHz are accepted |
| Knowledge cutoff | Sep 30, 2024 |
| Latency target (docs) | 0.4 – 3.0 seconds depending on use case |
Full pricing worked examples → references/pricing-limits.md.
Per the research file, the WebSocket URL for the transcription endpoint is [UNCONFIRMED] — two likely shapes need to be probed:
# Option A (likely; matches conversational base path)
wss://api.openai.com/v1/realtime?model=gpt-realtime-whisper
# Option B (model card lists this REST path; may also be the WS path)
wss://api.openai.com/v1/realtime/transcription_sessions?model=gpt-realtime-whisper
Header: Authorization: Bearer ${OPENAI_API_KEY}
The discriminator is session.type = "transcription" inside the first session.update. That is what tells the realtime backend you want transcription semantics rather than a conversational response loop.
Minimal session config (verbatim shape from the transcription guide):
{
"type": "session.update",
"session": {
"type": "transcription",
"audio": {
"input": {
"format": { "type": "audio/pcm", "rate": 24000 },
"transcription": {
"model": "gpt-realtime-whisper",
"language": "en"
},
"turn_detection": { "type": "server_vad" }
}
},
"include": ["item.input_audio_transcription.logprobs"]
}
}
Key fields:
session.type = "transcription" — switches the session into transcription mode (no response loop, no audio output).audio.input.transcription.model = "gpt-realtime-whisper" — selects the model.audio.input.transcription.language — ISO-639-1 hint. Omit to let the model auto-detect; supply when known to reduce latency and improve quality.audio.input.turn_detection — server_vad chunks audio at utterance boundaries automatically; manual mode requires sending input_audio_buffer.commit yourself.include = ["item.input_audio_transcription.logprobs"] — opt-in per-token log probabilities for confidence-aware UI rendering.Audio chunks then arrive via the unprefixed input_audio_buffer.append event (matching the conversational endpoint vocabulary — NOT the session.-prefixed translation endpoint):
{ "type": "input_audio_buffer.append", "audio": "<base64 PCM16 24kHz>" }
For manual mode (no VAD), follow up with input_audio_buffer.commit to flush.
The transcription endpoint emits two transcription events flagged in the research file:
conversation.item.input_audio_transcription.delta — event.delta carries newly available transcript text. Identifies the in-progress utterance via event.item_id.conversation.item.input_audio_transcription.completed — final transcript for that item_id. The model is committing the text; it will not revise this segment further.Rendering rule of thumb.
.delta events stream in, append to a per-item_id buffer and render it as speculative — italic, lower opacity, no line wrap commit..completed, replace the speculative buffer for that item_id with the final text, render it in the regular style, and scroll-anchor it above the next live region.item_id — the model can correct a word it heard wrong now that it has more context. The safe pattern is: keep the speculative region as a single replaceable block per item_id (don't append-only) and re-render the full buffer each delta. On .completed, the segment is locked.[UNCONFIRMED] — the research file does not explicitly document delta retraction / replacement semantics for gpt-realtime-whisper. Streaming Whisper variants typically allow mid-segment correction; assume revisions happen and code defensively. Probe on a live session to confirm.
Full deep-dive on event flow, payload shapes, UX patterns (append-only / replace-last / finalized-above + speculative-below) → references/streaming-deltas.md.
The transcription guide cites 0.4 – 3.0 s end-to-end latency depending on configuration. Three operating points are useful in practice:
| Profile | Target end-to-end | Use cases |
|---|---|---|
| Ultra-low | ~0.4 s | Live captions for streamed video / audio, IDE voice command, real-time accessibility overlays |
| Balanced | 0.8 – 1.2 s | Meetings, call-center, customer support — default for most cases |
| Quality-max | 1.5 – 2 s | Medical / legal dictation, broadcast captioning, anything where accuracy outweighs speed |
What controls latency:
input_audio_buffer.append. [UNCONFIRMED] whether the transcription endpoint uses the unprefixed input_audio_buffer.append (research shows yes — it matches the conversational endpoint) or the session.-prefixed translation variant. Smaller chunks (e.g. 20 ms) cut buffer latency at the cost of more frames per second on the wire. 100 ms is a balanced default.audio.input.turn_detection with server_vad chunks at utterance boundaries. Lower silence_duration_ms and threshold produce earlier commits (lower latency, more false splits); raising them improves coherence (higher latency).audio.input.transcription.language explicitly skips the model's auto-detect step, shaving a small but measurable amount off first-delta latency.api.openai.com. See references/latency-tuning.md for the network-budget breakdown.Full per-profile config and decision tree → references/latency-tuning.md.
The model has a 16K context window and 2K max output tokens. For multi-hour transcription (long meetings, all-day captioning, hotline shifts), a single session will eventually exhaust either limit.
Rotation pattern.
item_ids do not carry over across sessions. In your UI / store, prepend a session-level prefix so transcript items remain unique across rotations.references/pricing-limits.md for a worked rotation example.[UNCONFIRMED] exact hard-cap behavior. The research file does not say what happens when the 16K context is exhausted — silent drop, error event, or session close. Implement rotation prophylactically rather than waiting for the failure mode to manifest.
The transcription endpoint accepts WebSocket; clients vary by where the audio originates.
getUserMedia, downsamples / encodes to PCM16 24 kHz in an AudioWorkletNode, base64-chunks, and sends via WebSocket to your server, which forwards to OpenAI.client_secrets) [UNCONFIRMED] — the research file lists the conversational and translation mint endpoints but did not enumerate one for transcription_sessions. Server-proxy is the safe default.→ see examples/live-captions.tsx (to be added)
input_audio_buffer.append.[UNCONFIRMED] whether audio/pcmu is accepted on this endpoint — if so, you can skip the resample and feed Twilio's native frames through. Until proven, resample..completed event corresponds to a usable transcript segment for downstream CRM logging.→ see examples/twilio-transcribe.py (to be added)
whisper-1.→ see examples/whatsapp-voice-notes.py (to be added)
voice-translate-live if multilingual translation is needed.→ see examples/meeting-minutes.py (to be added)
| Property | Value |
|---|---|
| Pricing | $0.017 / minute of audio duration |
| Context window | 16,000 tokens |
| Max output | 2,000 tokens |
| Billing unit | Audio duration (not tokens) |
| Rate limits | [UNCONFIRMED] — see OpenAI platform docs |
Worked examples (60-min support call, 200-hour/day call center, 8-hour live event, rotated long session) → references/pricing-limits.md.
response.create, no response.output_* events. If you wrote code that listens for response.output_audio_transcript.delta, you're on the wrong skill — that's voice-agent-realtime. The transcription events are conversation.item.input_audio_transcription.delta / .completed..delta chunk, mid-segment corrections produce garbled doubled text. Render each item_id as a single replaceable block until .completed.## Long-session strategy.[UNCONFIRMED] whether telephony's 8 kHz μ-law (audio/pcmu) is accepted directly; until probed, resample on ingress.[UNCONFIRMED]. The model card and transcription guide did not mention speaker diarization for gpt-realtime-whisper. If you need speaker labels, evaluate gpt-4o-transcribe-diarize or layer a separate diarizer on the audio.session.input_audio_buffer.append (prefixed); this skill uses the unprefixed input_audio_buffer.append. Don't copy-paste between adapters without checking.session.created, session.updated, input_audio_buffer.speech_started/stopped/committed, error are [UNCONFIRMED] for this endpoint — likely present (shared session vocabulary with the conversational endpoint) but not contractually documented. Probe before depending on them.Long-form references (this dispatch):
references/streaming-deltas.md — full event flow, payload shapes, partial vs final, UX patterns for live captions, corrections / rollback, cross-reference to the conversational endpoint events catalogue.references/latency-tuning.md — per-profile config deep-dive (ultra-low / balanced / quality-max), trade-offs, measurement protocol, decision tree, network considerations.references/pricing-limits.md — full pricing table, 4 worked cost examples (60-min support call, 200h/day call center, 8h live event, rotated 1h session), context limits, rate limits.Examples (to be added):
examples/live-captions.tsx — browser WebSocket client with speculative-delta UI for live captions overlay.examples/twilio-transcribe.py — Twilio Media Streams ↔ OpenAI transcription bridge with VAD-driven CRM logging.examples/whatsapp-voice-notes.py — WhatsApp Cloud API voice-note streaming transcription.examples/meeting-minutes.py — Recall.ai meeting-bot transcription with 10-minute session rotation.Scripts (to be added):
scripts/test-latency-vs-quality.py — live measurement harness that compares the three latency profiles against a reference clip and reports first-delta / final-completed latencies plus WER.Canonical source for every fact in this skill: /Users/youssef/AgentVocalOpco/docs/research/openai-realtime-api-2026-05-11.md.
npx claudepluginhub generovo/claude-voice-skills --plugin claude-voice-skillsCreates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.