Google AI has released TranslateGemma, a suite of open machine translation models built on Gemma 3 and targeted at 55 languages. The family comes in 4B, 12B and 27B parameter sizes. It is designed to run on all devices, from mobile and edge hardware to laptops and a single H100 GPU or TPU instance in the cloud.
TranslateGemma is not a separate architecture. This is Gemma 3 specialized for translation through a two-stage post-training pipeline. (1) Supervised fine tuning on large parallel corpora. (2) Reinforcement learning that optimizes translation quality with multi-signal reward ensembles. The goal is to promote translation quality while adhering to the Gemma 3 practice general directive.
Supervised Fine Tuning on Synthetic and Human Parallel Data
The supervised fine tuning phase begins with the public Gemma 3 checkpoints 4b, 12b and 27b. The research team uses parallel data that combines human translations with high-quality synthetic translations generated by the Gemini model.
Synthetic data is drawn from monolingual sources with a multi-step process. The pipeline selects candidate sentences and small documents, feeds them into Gemini 2.5 Flash, and then filters the output with MetricX24QI to keep only examples that show clear quality gains. It is applied to all WMT24 Plus Plus language pairs and 30 other language pairs.
Low-resource languages ​​obtain human-generated parallel data from the SMOL and GATITOS datasets. SMOL covers 123 languages ​​and GATITOS covers 170 languages. This improves coverage of scripts and language families that are underrepresented in publicly available web parallel data.
The final supervised fine tuning blend also keeps 30 percent of the general instructions following the data of the original Gemma 3 blend. This is important. Without this, the model will over-specialize on pure translation and lose common LLM behavior such as following instructions or doing simple reasoning in context.
Training uses Kauldron SFT (Supervised Fine Tuning) tooling with the AdaFactor optimizer. The learning rate is 0.0001 with batch size 64 for 200000 steps. All model parameters have been updated except the token embeddings, which are frozen. Freezing embeddings helps preserve representation quality for languages ​​and scripts that do not appear in the supervised fine tuning data.
Reinforcement learning with translation-focused reward set
After supervised fine tuning, TranslateGemma runs a reinforcement learning stage on top of the same translation data mixture. Reinforcement learning aims to utilize multiple reward models.
The reward group includes:
- The metricX24XXLQI, a learned regression metric that estimates the MQM score and is used here in quality assessment mode without any reference.
- Gemma AutoMQM Queue, a span level error predictor fine-tuned from Gemma 3 27b IT on MQM labeled data. It generates token level rewards based on error type and severity.
- ChrF, a character n gram overlap metric that compares model outputs with synthetic references and is rescaled to match other results.
- A Naturalness Automator that uses the policy model as an LLM judge and generates span level penalties for segments that do not look like the original text.
- A generalist reward model from Gemma 3 post training setup that preserves the ability to reason and follow instructions.
TranslateGemma uses reinforcement learning algorithms that combine sequence level rewards with token level benefits. Span level rewards from AutoMQM and Naturalness Autorater are directly tied to the affected tokens. These token profits are added to the sequence profits calculated from the bounty and then batch normalized. This improves credit assignment compared to pure sequence level reinforcement learning.
Benchmark results on WMT24++
TranslateGemma is rated on WMT24++ Benchmarks using metricx24 and comet22. MetricX is lower the better and is related to the MQM error calculation. Comet22 Higher is better and measures adequacy and flow.

The above table of the research paper summarizes the results of the English-focused assessment on 55 language pairs.
- 27b: Gemma 3 baseline has metricx 4.04 and comet22 83.1. TranslationGemma27b reaches metricX 3.09 and comet22 84.4.
- 12b: Gemma 3 baseline has metricx 4.86 and comet22 81.6. TranslationGemma 12B reaches metricX 3.60 and Comet22 83.5.
- 4b: Gemma 3 baseline has metricx 6.97 and comet22 77.2. TranslateGemma 4B reaches MetricX 5.32 and Comet22 80.1.
The main pattern is that TranslateGemma improves quality for each model size. Additionally, model scale interacts with expertise. 12b TranslationGemma Model 27b Gemma 3 surpasses the baseline. The 4B TranslationGemma model reaches the same quality as the 12B Gemma 3 baseline. This means that a small translation-specific model can replace a larger baseline model for many machine translation workloads.


The language level analysis in the appendix table above from the research paper shows that these benefits appear across all 55 language pairs. For example, MetricX improves English to German from 1.63 to 1.19, English to Spanish from 2.54 to 1.88, English to Hebrew from 3.90 to 2.72, and English to Swahili from 5.92 to 4.45. The improvements are large even for difficult cases like English to Lithuanian, English to Estonian, and English to Icelandic.
Human evaluation on WMT25 with MQM confirms this trend. TranslateGemma 27B generally gives lower MQM scores with lower weighted errors than Gemma 3 27B, with a particularly strong advantage for low resource directions such as English to Marathi, English to Swahili, and Czech to Ukrainian. There are two notable exceptions. Both systems are too close for the Germans as targets. Japanese to English translation of Gemma shows regression caused primarily by named entity errors, even though other error categories have improved.
Multimodal translation and interface for developers
TranslateGemma inherits the image understanding stack of Gemma 3. The research team evaluates image translation on the Vistra benchmark. They select 264 images each containing a text example. The model receives only the image and a prompt asking it to translate the text in the image. There is no separate bounding box input and no explicit OCR step.
At this setting, TranslateGemma 27B improves MetricX from 2.03 to 1.58 and Comet22 from 76.1 to 77.7. The 4B variant shows a small but positive benefit. The 12b model improves on MetricX but has a slightly lower Comet22 score compared to the baseline. Overall, the research team concluded that TranslateGemma retains the multimodal capability of Gemma 3 and that the text translation improvements mostly carry over to image translation.
key takeaways
- TranslateGemma is a special Gemma 3 version for translation: TranslateGemma is a suite of open translation models derived from Gemma 3, with 4B, 12B and 27B parameter sizes, optimized for 55 languages ​​through a two-stage pipeline, supervised fine tuning and reinforcement learning with translation-focused rewards.
- Training Gemini combines synthetic data with human parallel corpora: The model is fine-tuned on a mix of high-quality synthetic parallel data generated by Gemini and human translated data, improving coverage for both high-resource and low-resource languages ​​while preserving general LLM capabilities from Gemma 3.
- Reinforcement learning uses a set of quality assessment rewards: After supervised fine-tuning, TranslateGemma applies reinforcement learning driven by a suite of reward models, including MetricX QE and AutoMQM, which explicitly targets translation quality and fluency rather than general chat behavior.
- Smaller models match or beat larger Gemma 3 baselines on WMT24++:on WMT24++ Across 55 languages, all TranslationGemma sizes show consistent improvement compared to Gemma 3, with the 12B model surpassing the 27B Gemma 3 baseline and the 4B model reaching quality equivalent to the 12B baseline, reducing compute requirements for a given translation quality level.
- Models retain multimodel capabilities and are issued as open weights.:TranslateGemma Gemma 3 maintains image text translation capabilities and improves performance on the Vistra Image Translation benchmark, and weights are released as open models on Hugging Face and Vertex AI, enabling local and cloud deployment.
check it out paper, model weight And technical details. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Asif Razzaq Marktechpost Media Inc. Is the CEO of. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. Their most recent endeavor is the launch of MarketTechPost, an Artificial Intelligence media platform, known for its in-depth coverage of Machine Learning and Deep Learning news that is technically robust and easily understood by a wide audience. The platform boasts of over 2 million monthly views, which shows its popularity among the audience.
