Google has just announced Gemini 3.5 Live Translate. This is their latest audio model for live speech-to-speech translation. Speech-to-speech means spoken audio goes in, and translated spoken audio comes out. The model automatically detects over 70 languages and generates translated speech. This preserves the speaker’s intonation, speed, and pitch in the output. Turn-by-turn systems wait for the speaker to finish before responding. Gemini 3.5 Live Translate instead generates speech continuously. This strikes a balance between waiting for context and translating immediately. More context improves quality. Fast output keeps translations in sync with the speaker. The results lag behind the speaker by a few seconds during the entire session.
Gemini 3.5 Live Translate
Gemini 3.5 Live Translation is a single audio model (gemini-3.5-live-translate-preview), no chat assistant. It processes speech as an audio stream rather than as an entire sentence. It handles multilingual input without manually configuring settings. Its noise robustness allows applications to run in fast, unpredictable environments.
The model operates on three surfaces. Developers get it in public preview through the Gemini Live API and Google AI Studio. Enterprises will get a private preview in Google Meet starting this month. Everyone else gets it through the Google Translate app on Android and iOS.
How does continuous streaming work?
The design difference matters for building real-time features. A conversational live agent uses turn-based interactions. It depends on pause, intent detection and interruption handling. Live Translation uses continuous stream processing instead. It translates as the speaker talks, without waiting for the translation to finish.
To maintain strict real-time latency limits, the translation path only accepts audio input. Text input is not supported in translate mode. The model also skips tool usage and system instructions in this mode. This gives it a focused translation pipeline rather than a general agent.
Build with Live API
Developers configure translation inside a live API session setup. you set a translationConfig block within generationConfig. targetLanguageCode The field takes a BCP-47 code, like "pl" Or "es". BCP-47 is the standard format for language tags such as en Or pt-BR. this is the default "en". echoTargetLanguage Boolean controls the input that is already in the target language. When? trueThe model echoes that speech. When? falseIt remains silent. You can also enable inputAudioTranscription And outputAudioTranscription For lesson transcripts.
Audio formats are fixed. The input is raw 16-bit PCM at 16kHz, mono, little-endian. The output is raw 16-bit PCM at 24kHz, mono, little-endian. PCM is uncompressed raw audio. You send the audio in chunks of 100 ms. For client-side apps, short-lived tokens v1alpha Endpoints avoid exposing your API key.
| Dimensions | live agent | live translation |
|---|---|---|
| role model | Assistant that listens, reasons and acts | Interpreter/Real Time Translator Pipeline |
| interaction | Turn-based, with obstacle handling | Continuous stream processing, no turning |
| tool | Function Calling, Google Search, Instructions | Translation only, no tools or instructions |
| input | Text, Audio, Video and Image | Audio only for tight latency |
| layout | generation, speech, tools, instructions | targetLanguageCode And echoTargetLanguage |
Example
The model targets live interpretation in multiple settings. Google lists multilingual calls, meetings, texts, and broadcasts. Developer platforms ease the integration work for real-time media. Agora, FishJam, LiveKit, Pipecat, and Vision Agent already use the Live API. These platforms handle complex real-time media streaming infrastructure. This allows developers to focus on user experience.
Google’s example app demonstrates dubbing and simultaneous multilingual translation. Grab is testing the model for driver-and-passenger communication on pickups. Grab users make more than 10 million voice calls per month. CJ ENM, LiveKit and others gave positive feedback on quality, accuracy and low latency.
How it changes Google Meet and Translate
According to Google’s official release, Google Meet will soon use Live Translate 3.5 for speech translation. The table shows before and after information for the meet.
| Capacity | last meeting | 3.5 with live translation |
|---|---|---|
| Languages | 5 | 70+ |
| combination per meeting | From and to English only | 2000+ combinations |
| access | existing interface | Updated interface for quick access |
The Meet update is in private preview for select Business Workspace customers this month. A broader rollout will take place later this year. In the Translate app, the live translation feature works with any connected headphones. It reflects the speaker’s voice in over 70 languages. Android also gets a listening mode. You hold the phone near your ear like a regular call. The translated audio then streams through the earpiece, without being heard by others.
key takeaways
- Gemini 3.5 Live Translate is Google’s latest audio model for live speech-to-speech translation in 70+ languages.
- It streams continuously instead of alternating, lagging behind the speaker by a few seconds.
- Developers can configure it using the live API
targetLanguageCodeAndechoTargetLanguage; Audio only, 16kHz in, 24kHz out. - It has been rolled out on Gemini Live API, Google Meet (5→70+ languages) and Translate app.
- All generated audio has an inconspicuous SynthID watermark for traceability.
check it out model card And technical details. Also, feel free to follow us Twitter And don’t forget to join us 150k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us