Beyond simple API requests: How OpenAI's WebSocket mode changes the game for low latency voice-driven AI experiences

In the world of generative AI, latency is the ultimate killer of immersion. Until recently, building a voice-enabled AI agent seemed like assembling a Rube Goldberg machine: You’d pipe the audio into a speech-to-text (STT) model, send the transcript to a large language model (LLM), and finally send the text to a text-to-speech (TTS) engine. Each hop added a lag of hundreds of milliseconds.

OpenAI has dismantled this stack Realtime API. by making a dedicated offering websocket modeThe platform provides a straightforward, consistent pipe into GPT-4o’s native multimodal capabilities. This represents a fundamental shift from stateless request-response cycles to stateful, event-driven streaming.

Protocol Shift: Why WebSocket?

The industry has long relied on standard HTTP POST requests. While text streaming through Server-Sent Events (SSE) made LLM faster, once introduced it remained a one-way street. Realtime API uses this WebSocket Protocol (wss://)Providing a full-duplex communication channel.

For developers creating voice assistants, this means the model can ‘listen’ and ‘talk’ simultaneously over the same connection. To connect, customers point to:

wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview

Core Architecture: Sessions, Responses and Items

Understanding Realtime APIs requires mastering three specific entities:

Session: Global configuration. through a session.update event, engineers define the system prompt, voice (for example, Metal Alloys, ash, Coral), and audio formats.
the item: Each conversation element – the user’s speech, the model’s output, or the tool call – is a item stored server-side conversation State.
feedback: Order to take action. one is being sent response.create The event tells the server to check the conversation status and generate a reply.

Audio Engineering: PCM16 and G.711

OpenAI’s WebSocket mode works on encoded raw audio frames base 64. It supports two primary formats:

PCM16: 16-bit pulse code modulation at 24kHz (ideal for high-fidelity apps).
G.711: Perfect for 8kHz telephony standards (U-Law and A-Law), VoIP and SIP integration.

Developers need to stream audio in small chunks (typically 20-100ms) input_audio_buffer.append Events. Model then streams back response.output_audio.delta Events for instant playback.

VAD: from tacit to semantic

A major update is expanding Voice Activity Detection (VAD). While the standard server_vad Uses silence limit, new semantic_vad It uses a classifier to understand whether the user has truly finished or is merely stopping for consideration. This prevents the AI from awkwardly interrupting the user mid-sentence, which was a common ‘Uncanny Valley’ issue in earlier voice AI.

Event-driven workflow

Working with WebSockets is inherently asynchronous. Instead of waiting for a single response, you listen to a bunch of server events:

input_audio_buffer.speech_started:The model listens to the user.
response.output_audio.delta: The audio snippet is ready to play.
response.output_audio_transcript.delta: Lesson transcripts arrive in real time.
conversation.item.truncate: This is used when a user intervenes, allowing the client to tell the server where to “cut” the model’s memory to match what the user actually heard.

key takeaways

Full-duplex, state-based communication: Unlike traditional stateless REST APIs, the WebSocket protocol (wss://) enables a continuous, bidirectional connection. This allows simultaneous ‘listening’ and ‘speaking’ while keeping the model live Session Communicate the entire conversation history with each turn, eliminating the need to re-send.
Basic Multimodal Processing: The API bypasses the STT → LLM → TTS pipeline. By processing audio natively, GPT-4o reduces latency and can perceive and generate subtle transpersonal features such as tone, emotion and inflection which are usually lost in text transcription.
Granular Event Control: The architecture relies on specific server-sent events for real-time interactions. Major events include input_audio_buffer.append To stream parts into the model and response.output_audio.delta To capture audio snippets, allowing instant, low-latency playback.
Advanced Sound Activity Detection (VAD): Transition from simple silence-based server_vad To semantic_vad Allows the model to distinguish between a user pausing for thought and a user finishing their sentence. This prevents awkward interruptions and creates a more natural conversation flow.

check it out technical details. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.

Beyond simple API requests: How OpenAI’s WebSocket mode changes the game for low latency voice-driven AI experiences

Protocol Shift: Why WebSocket?

Core Architecture: Sessions, Responses and Items

Audio Engineering: PCM16 and G.711

VAD: from tacit to semantic

Event-driven workflow

key takeaways

AI bots created a cartel. No one told them.

Scaling data annotation using vision-language models to empower physical AI systems

Related Articles

Leave a Comment Cancel Reply